[erlang-bugs] net_kernel hang, perhaps blocked by busy_dist_port race?
Scott Lystig Fritchie
fritchie@REDACTED
Sun May 16 04:51:59 CEST 2010
New update: recipe to duplicate.
-Scott
This recipe works for:
Erlang/OTP R13B04 on Linux kernel 2.6.27.41-170.2.117.fc10.i686
Erlang/OTP R13B03 on same
Erlang/OTP R13B02 on same
Erlang/OTP R13B01 on same
Erlang/OTP R12B-5 on same
Erlang/OTP R11B-5 on same
The recipe requires a bit of luck and human intervention (pressing
Control-z at the right moment). But I can get the error to happen
within a few minute's worth of trying.
Step #1, in terminal #1:
Run the following command:
erl -sname foo1 -kernel net_ticktime 20 -eval 'register(foo, self()), [net_adm:ping(bar1@REDACTED) || _ <- lists:seq(1,100000)], erlang:display(done).'
Step #2, in terminal #2:
Run the following command:
erl -sname bar1 -kernel net_ticktime 20 -eval 'F = fun(Ff) -> receive X -> io:format("Got msg ~p\n", [X]), Ff(Ff) end end, spawn(fun() -> io:format("I am: ~p\n", [self()]), erlang:system_monitor(self(), [busy_port, busy_dist_port]), net_kernel:monitor_nodes(true, [{node_type, visible}, nodedown_reason]), F(F) end), L1m = lists:seq(1,1000000), [{foo, foo1@REDACTED} ! {bar, L1m} || _ <- lists:seq(1,555000)], erlang:display(done).'
As soon as you start seeting these messages in terminal #2:
Got msg {monitor,<0.2.0>,busy_dist_port,#Port<0.98>}
Got msg {monitor,<0.2.0>,busy_dist_port,#Port<0.98>}
Got msg {monitor,<0.21.0>,busy_dist_port,#Port<0.98>}
... then you're ready for step #3.
NOTE: On my machine, pid <0.2.0> is the process that is executing the
code in the "-eval" flag, and pid <0.21.0> is the 'net_kernel' process.
NOTE: Using different releases of Erlang/OTP, the 'net_kernel' pid may
vary slightly, but it's in the 20's.
Step #3, in terminal #1:
When you see the {monitor,<0.21.0>,...} message in terminal #2, press
Control-Z in terminal #1. If that message is the most recent/last
message, wait for 20 seconds or more. You probably will not see a
'{nodedown,...}' message in terminal #2.
If you weren't fast or lucky enough, type "fg" here in terminal #1 and
then press Control-z again when you're feeling fast or lucky. You're
sending big messages over to the terminal #1 node, so if you let this
run for too long, you'll run out of memory over there and crash.
Step #4, in terminal #3:
erl -sname goofus1 -kernel net_ticktime 20 -remsh bar1@REDACTED
If you get an error in 10 seconds or less, congratulations!
Erlang R13B04 (erts-5.7.5) [source] [smp:2:2] [rq:2] [async-threads:0] [hipe] [kernel-poll:false]
*** ERROR: Shell process terminated! (^G to start new job) ***
Step #5, in terminal #2:
Press Control-g, then type the following at the prompts:
--> s
--> c 2
(bar1@REDACTED)1> process_info(whereis(net_kernel)).
(bar1@REDACTED)3> io:format("~s\n", [element(2,process_info(whereis(net_kernel), backtrace))]).
In one of my attempts, I managed to get lucky and also managed to
block the 'global_name_server' process, <0.12.0>.
(bar1@REDACTED)2> process_info(pid(0,12,0)).
[{registered_name,global_name_server},
{current_function,{erlang,bif_return_trap,1}},
{initial_call,{proc_lib,init_p,5}},
{status,suspended},
{message_queue_len,0},
{messages,[]},
{links,[<0.13.0>,<0.14.0>,<0.15.0>,<0.10.0>]},
{dictionary,[{{prot_vsn,foo1@REDACTED},5},
{{sync_tag_my,foo1@REDACTED},{1273,970221,319805}},
{'$ancestors',[kernel_sup,<0.9.0>]},
{{sync_tag_his,foo1@REDACTED},{1273,970221,329161}},
{'$initial_call',{global,init,1}}]},
{trap_exit,true},
{error_handler,error_handler},
{priority,normal},
{group_leader,<0.8.0>},
{total_heap_size,987},
{heap_size,610},
{stack_size,24},
{reductions,789},
{garbage_collection,[{min_bin_vheap_size,46368},
{min_heap_size,233},
{fullsweep_after,65535},
{minor_gcs,5}]},
{suspending,[]}]
(bar1@REDACTED)4> io:format("~s\n", [element(2,process_info(pid(0,12,0), backtrace))]).
Program counter: 0x08243388 (unknown function)
CP: 0xb7704738 (global:do_monitor/1 + 172)
arity = 1
#Ref<0.0.0.265>
0xb748ab40 Return addr 0xb76fe8ec (global:insert_lock/4 + 100)
y(0) []
y(1) <4514.13.0>
0xb748ab4c Return addr 0xb76fe704 (global:handle_set_lock/3 + 236)
y(0) []
y(1) []
y(2) {state,true,[],[],[{'foo1@REDACTED',{1273,970221,319805},<0.61.0>}],[],'nonode@REDACTED',<0.13.0>,<0.14.0>,<0.15.0>,no_trace,false}
y(3) [{<0.13.0>,<0.13.0>,#Ref<0.0.0.259>}]
y(4) <4514.13.0>
y(5) {global,[<0.13.0>,<4514.13.0>]}
y(6) [<0.13.0>,<4514.13.0>]
y(7) global
0xb748ab70 Return addr 0xb76fa488 (global:handle_call/3 + 184)
0xb748ab74 Return addr 0xb769f8e4 (gen_server:handle_msg/5 + 152)
0xb748ab78 Return addr 0xb7675258 (proc_lib:init_p_do_apply/3 + 28)
y(0) global
y(1) {state,true,[],[],[{'foo1@REDACTED',{1273,970221,319805},<0.61.0>}],[],'nonode@REDACTED',<0.13.0>,<0.14.0>,<0.15.0>,no_trace,false}
y(2) global_name_server
y(3) <0.10.0>
y(4) {set_lock,{global,[<0.13.0>,<4514.13.0>]}}
y(5) {<4514.13.0>,{#Ref<4514.0.0.20846>,'bar1@REDACTED'}}
y(6) Catch 0xb769f8e4 (gen_server:handle_msg/5 + 152)
0xb748ab98 Return addr 0x0824852c (<terminate process normally>)
y(0) Catch 0xb7675268 (proc_lib:init_p_do_apply/3 + 44)
ok
More information about the erlang-bugs
mailing list