[erlang-bugs] net_kernel hang, perhaps blocked by busy_dist_port race?

Scott Lystig Fritchie fritchie@REDACTED
Sun May 16 04:51:59 CEST 2010


New update: recipe to duplicate.

-Scott

This recipe works for:

    Erlang/OTP R13B04 on Linux kernel 2.6.27.41-170.2.117.fc10.i686
    Erlang/OTP R13B03 on same
    Erlang/OTP R13B02 on same
    Erlang/OTP R13B01 on same
    Erlang/OTP R12B-5 on same
    Erlang/OTP R11B-5 on same

The recipe requires a bit of luck and human intervention (pressing
Control-z at the right moment).  But I can get the error to happen
within a few minute's worth of trying.

Step #1, in terminal #1:

Run the following command:

    erl -sname foo1 -kernel net_ticktime 20 -eval 'register(foo, self()), [net_adm:ping(bar1@REDACTED) || _ <- lists:seq(1,100000)], erlang:display(done).'

Step #2, in terminal #2:

Run the following command:

    erl -sname bar1 -kernel net_ticktime 20 -eval 'F = fun(Ff) -> receive X -> io:format("Got msg ~p\n", [X]), Ff(Ff) end end, spawn(fun() -> io:format("I am: ~p\n", [self()]), erlang:system_monitor(self(), [busy_port, busy_dist_port]), net_kernel:monitor_nodes(true, [{node_type, visible}, nodedown_reason]), F(F) end), L1m = lists:seq(1,1000000), [{foo, foo1@REDACTED} ! {bar, L1m} || _ <- lists:seq(1,555000)], erlang:display(done).'

As soon as you start seeting these messages in terminal #2:

    Got msg {monitor,<0.2.0>,busy_dist_port,#Port<0.98>}
    Got msg {monitor,<0.2.0>,busy_dist_port,#Port<0.98>}
    Got msg {monitor,<0.21.0>,busy_dist_port,#Port<0.98>}

... then you're ready for step #3.

NOTE: On my machine, pid <0.2.0> is the process that is executing the
code in the "-eval" flag, and pid <0.21.0> is the 'net_kernel' process.

NOTE: Using different releases of Erlang/OTP, the 'net_kernel' pid may
vary slightly, but it's in the 20's.

Step #3, in terminal #1:

When you see the {monitor,<0.21.0>,...} message in terminal #2, press
Control-Z in terminal #1.  If that message is the most recent/last
message, wait for 20 seconds or more.  You probably will not see a
'{nodedown,...}' message in terminal #2.

If you weren't fast or lucky enough, type "fg" here in terminal #1 and
then press Control-z again when you're feeling fast or lucky.  You're
sending big messages over to the terminal #1 node, so if you let this
run for too long, you'll run out of memory over there and crash.

Step #4, in terminal #3:

    erl -sname goofus1 -kernel net_ticktime 20 -remsh bar1@REDACTED

If you get an error in 10 seconds or less, congratulations!

    Erlang R13B04 (erts-5.7.5) [source] [smp:2:2] [rq:2] [async-threads:0] [hipe] [kernel-poll:false]
    
    *** ERROR: Shell process terminated! (^G to start new job) ***

Step #5, in terminal #2:

Press Control-g, then type the following at the prompts:

    --> s
    --> c 2
    (bar1@REDACTED)1> process_info(whereis(net_kernel)).
    (bar1@REDACTED)3> io:format("~s\n", [element(2,process_info(whereis(net_kernel), backtrace))]). 

In one of my attempts, I managed to get lucky and also managed to
block the 'global_name_server' process, <0.12.0>.

    (bar1@REDACTED)2> process_info(pid(0,12,0)).        
    [{registered_name,global_name_server},
     {current_function,{erlang,bif_return_trap,1}},
     {initial_call,{proc_lib,init_p,5}},
     {status,suspended},
     {message_queue_len,0},
     {messages,[]},
     {links,[<0.13.0>,<0.14.0>,<0.15.0>,<0.10.0>]},
     {dictionary,[{{prot_vsn,foo1@REDACTED},5},
                  {{sync_tag_my,foo1@REDACTED},{1273,970221,319805}},
                  {'$ancestors',[kernel_sup,<0.9.0>]},
                  {{sync_tag_his,foo1@REDACTED},{1273,970221,329161}},
                  {'$initial_call',{global,init,1}}]},
     {trap_exit,true},
     {error_handler,error_handler},
     {priority,normal},
     {group_leader,<0.8.0>},
     {total_heap_size,987},
     {heap_size,610},
     {stack_size,24},
     {reductions,789},
     {garbage_collection,[{min_bin_vheap_size,46368},
                          {min_heap_size,233},
                          {fullsweep_after,65535},
                          {minor_gcs,5}]},
     {suspending,[]}]

    (bar1@REDACTED)4> io:format("~s\n", [element(2,process_info(pid(0,12,0), backtrace))]).        
    Program counter: 0x08243388 (unknown function)
    CP: 0xb7704738 (global:do_monitor/1 + 172)
    arity = 1
       #Ref<0.0.0.265>
    
    0xb748ab40 Return addr 0xb76fe8ec (global:insert_lock/4 + 100)
    y(0)     []
    y(1)     <4514.13.0>
    
    0xb748ab4c Return addr 0xb76fe704 (global:handle_set_lock/3 + 236)
    y(0)     []
    y(1)     []
    y(2)     {state,true,[],[],[{'foo1@REDACTED',{1273,970221,319805},<0.61.0>}],[],'nonode@REDACTED',<0.13.0>,<0.14.0>,<0.15.0>,no_trace,false}
    y(3)     [{<0.13.0>,<0.13.0>,#Ref<0.0.0.259>}]
    y(4)     <4514.13.0>
    y(5)     {global,[<0.13.0>,<4514.13.0>]}
    y(6)     [<0.13.0>,<4514.13.0>]
    y(7)     global
    
    0xb748ab70 Return addr 0xb76fa488 (global:handle_call/3 + 184)
    
    0xb748ab74 Return addr 0xb769f8e4 (gen_server:handle_msg/5 + 152)
    
    0xb748ab78 Return addr 0xb7675258 (proc_lib:init_p_do_apply/3 + 28)
    y(0)     global
    y(1)     {state,true,[],[],[{'foo1@REDACTED',{1273,970221,319805},<0.61.0>}],[],'nonode@REDACTED',<0.13.0>,<0.14.0>,<0.15.0>,no_trace,false}
    y(2)     global_name_server
    y(3)     <0.10.0>
    y(4)     {set_lock,{global,[<0.13.0>,<4514.13.0>]}}
    y(5)     {<4514.13.0>,{#Ref<4514.0.0.20846>,'bar1@REDACTED'}}
    y(6)     Catch 0xb769f8e4 (gen_server:handle_msg/5 + 152)
    
    0xb748ab98 Return addr 0x0824852c (<terminate process normally>)
    y(0)     Catch 0xb7675268 (proc_lib:init_p_do_apply/3 + 44)
    
    ok


More information about the erlang-bugs mailing list