[erlang-bugs] R12B-3/64bit/smp Stuck disk_log_server
Geoff Cant
geoff.cant@REDACTED
Thu Jan 8 15:36:16 CET 2009
Hi all, Mats Cronqvist suggested I take this one up on erlang-bugs. What
follows is a rough transcript of a debugging session in which we suspect
that the reason an ejabberd node cannot dump mnesia logs is due to the
disk_log_server process being impossibly stuck in gen_server:loop/6.
It would be good if someone could confirm for me that my reasoning is
correct (or at least plausible) that the disk_log_server is stuck, that
this is the reason why mnesia can't dump logs and that the
disk_log_server is stuck in a seemingly impossible way.
The client on whose cluster this occurred has seen this problem before,
so we may get another chance at live debugging sometime in the near
future.
I would greatly appreciate any suggestions as to additional debugging
techniques I could try if this problem recurs.
Thank you,
--Geoff Cant
The erlang version information is "Erlang (BEAM) emulator version 5.6.3
[source] [64-bit] [smp:8] [async-threads:0] [hipe] [kernel-poll:false]"
- the stock Debian erlang-hipe-base from lenny on amd64 hardware.
(in the transcript nodenames and file paths have been slightly but
consistently rewritten to obscure some private network information)
We tried mnesia:dump_log() which hung, so we tried to figure out why.
mnesia_controller:get_workers(2000) => {workers,[],[],<0.22676.260>}
process_info(<0.22676.260>) =>
[{current_function,{gen,wait_resp_mon,3}},
{initial_call,{mnesia_controller,dump_and_reply,2}},
{status,waiting},
{message_queue_len,0},
{messages,[]},
{links,[<0.116.0>]},
{dictionary,[]},
{trap_exit,false},
{error_handler,error_handler},
{priority,normal},
{group_leader,<0.57.0>},
{total_heap_size,233},
{heap_size,233},
{stack_size,21},
{reductions,4311},
{garbage_collection,[{fullsweep_after,0},{minor_gcs,0}]},
{suspending,[]}]
Backtrace <0.22676.260>:
Program counter: 0x00007f61c0e0c2a8 (gen:wait_resp_mon/3 + 64)
CP: 0x00007f61c43645d8 (gen_server:call/3 + 160)
arity = 0
0x00007f60f844c108 Return addr 0x00007f61c43645d8 (gen_server:call/3 + 160)
y(0) infinity
y(1) #Ref<0.0.992.227032>
y(2) 'ejabberd@REDACTED'
0x00007f60f844c128 Return addr 0x00007f61c049f1d0 (mnesia_log:save_decision_tab/1 + 248)
y(0) infinity
y(1) {close_log,decision_tab}
y(2) <0.62.0>
y(3) Catch 0x00007f61c43645d8 (gen_server:call/3 + 160)
0x00007f60f844c150 Return addr 0x00007f61c03c6ec8 (mnesia_dumper:perform_dump/2 + 1648)
y(0) "/fake/path/ejabberd/DECISION_TAB.TMP"
y(1) []
0x00007f60f844c168 Return addr 0x00007f61c056bdd0 (mnesia_controller:dump_and_reply/2 + 152)
y(0) []
y(1) []
y(2) []
y(3) 15
y(4) []
y(5) []
0x00007f60f844c1a0 Return addr 0x000000000084bd18 (<terminate process normally>)
y(0) <0.116.0>
Here the log dumping process appears to be waiting on
gen_server:call(mnesia_monitor, {close_log,decision_tab}).
process_info(<0.62.0>) =>
[{registered_name,mnesia_monitor},
{current_function,{disk_log,monitor_request,2}},
{initial_call,{proc_lib,init_p,5}},
{status,waiting},
{message_queue_len,34},
{messages,[{nodeup,'gc@REDACTED'},
{nodedown,'gc@REDACTED'},
{nodeup,'fakenode1-16-26-06@REDACTED'},
{nodedown,'fakenode1-16-26-06@REDACTED'},
{nodeup,'fakenode1-16-27-20@REDACTED'},
{nodedown,'fakenode1-16-27-20@REDACTED'},
{nodeup,'fakenode1-16-29-25@REDACTED'},
{nodedown,'fakenode1-16-29-25@REDACTED'},
{nodeup,'gc@REDACTED'},
{nodedown,'gc@REDACTED'},
{nodeup,'fakenode2-16-36-53@REDACTED'},
{nodeup,'gc@REDACTED'},
{nodedown,'gc@REDACTED'},
{nodeup,'gc@REDACTED'},
{nodedown,'gc@REDACTED'},
{nodeup,'gc@REDACTED'},
{nodedown,'gc@REDACTED'},
{nodeup,'gc@REDACTED'},
{nodedown,'gc@REDACTED'},
{nodeup,...},
{...}|...]},
{links,[<6749.62.0>,<6753.62.0>,<0.111.0>,<0.22677.260>,
<6752.104.0>,<6747.62.0>,<6748.62.0>,<0.61.0>,<6751.62.0>,
<6750.62.0>,<0.52.0>]},
{dictionary,[{'$ancestors',[mnesia_kernel_sup,mnesia_sup,
<0.58.0>]},
{'$initial_call',{gen,init_it,
[gen_server,<0.61.0>,<0.61.0>,
{local,mnesia_monitor},
mnesia_monitor,
[<0.61.0>],
[{timeout,infinity}]]}}]},
{trap_exit,true},
{error_handler,error_handler},
{priority,normal},
{group_leader,<0.57.0>},
{total_heap_size,377},
{heap_size,377},
{stack_size,20},
{reductions,2326000},
{garbage_collection,[{fullsweep_after,0},{minor_gcs,0}]},
{suspending,[]}]
We didn't take a backtrace of mnesia_monitor, but
{current_function,{disk_log,monitor_request,2}} led us to think that
mnesia_monitor was trying to close the decision_tab log file, so we
tried to find out which process that was. At this point,
disk_log:info(decision_tab) hung, so we tried
disk_log_server:get_log_pids(decision_tab) which gave us
{local,<0.22681.260>}.
Backtrace <0.22681.260>:
Program counter: 0x00007f61c0e0c2a8 (gen:wait_resp_mon/3 + 64)
CP: 0x00007f61c43645d8 (gen_server:call/3 + 160)
arity = 0
0x00007f60f75818a0 Return addr 0x00007f61c43645d8 (gen_server:call/3 + 160)
y(0) infinity
y(1) #Ref<0.0.992.227035>
y(2) 'ejabberd@REDACTED'
0x00007f60f75818c0 Return addr 0x00007f61c03b1610 (disk_log:do_exit/4 + 440)
y(0) infinity
y(1) {close,<0.22681.260>}
y(2) disk_log_server
y(3) Catch 0x00007f61c43645d8 (gen_server:call/3 + 160)
0x00007f60f75818e8 Return addr 0x00007f61c0e23fe8 (proc_lib:init_p/5 + 400)
y(0) normal
y(1) []
y(2) <0.62.0>
y(3) ok
0x00007f60f7581910 Return addr 0x000000000084bd18 (<terminate process normally>)
y(0) Catch 0x00007f61c0e24008 (proc_lib:init_p/5 + 432)
y(1) disk_log
y(2) init
y(3) [<0.70.0>,<0.71.0>]
The disk_log process for 'decision_tab' was waiting for a reply from the
disk_log_server to gen_server:call(disk_log_server, {close, self()}).
Backtrace disk_log_server:
Program counter: 0x00007f61c4365af8 (gen_server:loop/6 + 288)
CP: 0x00007f61c0e23fe8 (proc_lib:init_p/5 + 400)
arity = 0
0x00007f60fb043f78 Return addr 0x00007f61c0e23fe8 (proc_lib:init_p/5 + 400)
y(0) []
y(1) infinity
y(2) disk_log_server
y(3) {state,[]}
y(4) disk_log_server
y(5) <0.30.0>
0x00007f60fb043fb0 Return addr 0x000000000084bd18 (<terminate process normally>)
y(0) Catch 0x00007f61c0e24008 (proc_lib:init_p/5 + 432)
y(1) gen
y(2) init_it
y(3)
[gen_server,<0.30.0>,<0.30.0>,{local,disk_log_server},disk_log_server,[],[]]
process_info(whereis(disk_log_server)) =>
[{registered_name,disk_log_server},
{current_function,{gen_server,loop,6}},
{initial_call,{proc_lib,init_p,5}},
{status,waiting},
{message_queue_len,1},
{messages,[{'$gen_call',{<0.22681.260>,#Ref<0.0.992.227035>},
{close,<0.22681.260>}}]},
{links,[<0.111.0>,<0.22677.260>,<0.22681.260>,<0.30.0>]},
{dictionary,[{<0.111.0>,latest_log},
{<0.22677.260>,previous_log},
{'$ancestors',[kernel_safe_sup,kernel_sup,<0.8.0>]},
{<0.22681.260>,decision_tab},
{'$initial_call',{gen,init_it,
[gen_server,<0.30.0>,<0.30.0>,
{local,disk_log_server},
disk_log_server,[],[]]}}]},
{trap_exit,true},
{error_handler,error_handler},
{priority,normal},
{group_leader,<0.7.0>},
{total_heap_size,246},
{heap_size,233},
{stack_size,12},
{reductions,2366165},
{garbage_collection,[{fullsweep_after,0},{minor_gcs,0}]},
{suspending,[]}]
Which appears to be doing something impossible - blocked in the receive
statement in gen_server:loop/6 with a valid message in its queue. We
used process_info to check the reductions a couple of times, but they
stayed the same at 2366165 over a period of at least a minute.
This line of investigation is all we have as the server has now been
restarted.
More information about the erlang-bugs
mailing list