[erlang-bugs] R12B-3/64bit/smp Stuck disk_log_server
Dan Gudmundsson
dgud@REDACTED
Thu Jan 8 15:55:58 CET 2009
I have seen it on erl-q (most of us devs read that list to),
your reasoning seems valid and currently I don't have any more ideas.
I have asked the emulator guys to take a look.
/Dan "Mnesia" G
Geoff Cant wrote:
> Hi all, Mats Cronqvist suggested I take this one up on erlang-bugs. What
> follows is a rough transcript of a debugging session in which we suspect
> that the reason an ejabberd node cannot dump mnesia logs is due to the
> disk_log_server process being impossibly stuck in gen_server:loop/6.
>
> It would be good if someone could confirm for me that my reasoning is
> correct (or at least plausible) that the disk_log_server is stuck, that
> this is the reason why mnesia can't dump logs and that the
> disk_log_server is stuck in a seemingly impossible way.
>
> The client on whose cluster this occurred has seen this problem before,
> so we may get another chance at live debugging sometime in the near
> future.
>
> I would greatly appreciate any suggestions as to additional debugging
> techniques I could try if this problem recurs.
>
> Thank you,
> --Geoff Cant
>
>
> The erlang version information is "Erlang (BEAM) emulator version 5.6.3
> [source] [64-bit] [smp:8] [async-threads:0] [hipe] [kernel-poll:false]"
> - the stock Debian erlang-hipe-base from lenny on amd64 hardware.
>
> (in the transcript nodenames and file paths have been slightly but
> consistently rewritten to obscure some private network information)
>
> We tried mnesia:dump_log() which hung, so we tried to figure out why.
>
> mnesia_controller:get_workers(2000) => {workers,[],[],<0.22676.260>}
>
> process_info(<0.22676.260>) =>
> [{current_function,{gen,wait_resp_mon,3}},
> {initial_call,{mnesia_controller,dump_and_reply,2}},
> {status,waiting},
> {message_queue_len,0},
> {messages,[]},
> {links,[<0.116.0>]},
> {dictionary,[]},
> {trap_exit,false},
> {error_handler,error_handler},
> {priority,normal},
> {group_leader,<0.57.0>},
> {total_heap_size,233},
> {heap_size,233},
> {stack_size,21},
> {reductions,4311},
> {garbage_collection,[{fullsweep_after,0},{minor_gcs,0}]},
> {suspending,[]}]
>
> Backtrace <0.22676.260>:
> Program counter: 0x00007f61c0e0c2a8 (gen:wait_resp_mon/3 + 64)
> CP: 0x00007f61c43645d8 (gen_server:call/3 + 160)
> arity = 0
>
> 0x00007f60f844c108 Return addr 0x00007f61c43645d8 (gen_server:call/3 + 160)
> y(0) infinity
> y(1) #Ref<0.0.992.227032>
> y(2) 'ejabberd@REDACTED'
>
> 0x00007f60f844c128 Return addr 0x00007f61c049f1d0 (mnesia_log:save_decision_tab/1 + 248)
> y(0) infinity
> y(1) {close_log,decision_tab}
> y(2) <0.62.0>
> y(3) Catch 0x00007f61c43645d8 (gen_server:call/3 + 160)
>
> 0x00007f60f844c150 Return addr 0x00007f61c03c6ec8 (mnesia_dumper:perform_dump/2 + 1648)
> y(0) "/fake/path/ejabberd/DECISION_TAB.TMP"
> y(1) []
>
> 0x00007f60f844c168 Return addr 0x00007f61c056bdd0 (mnesia_controller:dump_and_reply/2 + 152)
> y(0) []
> y(1) []
> y(2) []
> y(3) 15
> y(4) []
> y(5) []
>
> 0x00007f60f844c1a0 Return addr 0x000000000084bd18 (<terminate process normally>)
> y(0) <0.116.0>
>
> Here the log dumping process appears to be waiting on
> gen_server:call(mnesia_monitor, {close_log,decision_tab}).
>
> process_info(<0.62.0>) =>
> [{registered_name,mnesia_monitor},
> {current_function,{disk_log,monitor_request,2}},
> {initial_call,{proc_lib,init_p,5}},
> {status,waiting},
> {message_queue_len,34},
> {messages,[{nodeup,'gc@REDACTED'},
> {nodedown,'gc@REDACTED'},
> {nodeup,'fakenode1-16-26-06@REDACTED'},
> {nodedown,'fakenode1-16-26-06@REDACTED'},
> {nodeup,'fakenode1-16-27-20@REDACTED'},
> {nodedown,'fakenode1-16-27-20@REDACTED'},
> {nodeup,'fakenode1-16-29-25@REDACTED'},
> {nodedown,'fakenode1-16-29-25@REDACTED'},
> {nodeup,'gc@REDACTED'},
> {nodedown,'gc@REDACTED'},
> {nodeup,'fakenode2-16-36-53@REDACTED'},
> {nodeup,'gc@REDACTED'},
> {nodedown,'gc@REDACTED'},
> {nodeup,'gc@REDACTED'},
> {nodedown,'gc@REDACTED'},
> {nodeup,'gc@REDACTED'},
> {nodedown,'gc@REDACTED'},
> {nodeup,'gc@REDACTED'},
> {nodedown,'gc@REDACTED'},
> {nodeup,...},
> {...}|...]},
> {links,[<6749.62.0>,<6753.62.0>,<0.111.0>,<0.22677.260>,
> <6752.104.0>,<6747.62.0>,<6748.62.0>,<0.61.0>,<6751.62.0>,
> <6750.62.0>,<0.52.0>]},
> {dictionary,[{'$ancestors',[mnesia_kernel_sup,mnesia_sup,
> <0.58.0>]},
> {'$initial_call',{gen,init_it,
> [gen_server,<0.61.0>,<0.61.0>,
> {local,mnesia_monitor},
> mnesia_monitor,
> [<0.61.0>],
> [{timeout,infinity}]]}}]},
> {trap_exit,true},
> {error_handler,error_handler},
> {priority,normal},
> {group_leader,<0.57.0>},
> {total_heap_size,377},
> {heap_size,377},
> {stack_size,20},
> {reductions,2326000},
> {garbage_collection,[{fullsweep_after,0},{minor_gcs,0}]},
> {suspending,[]}]
>
> We didn't take a backtrace of mnesia_monitor, but
> {current_function,{disk_log,monitor_request,2}} led us to think that
> mnesia_monitor was trying to close the decision_tab log file, so we
> tried to find out which process that was. At this point,
> disk_log:info(decision_tab) hung, so we tried
> disk_log_server:get_log_pids(decision_tab) which gave us
> {local,<0.22681.260>}.
>
> Backtrace <0.22681.260>:
> Program counter: 0x00007f61c0e0c2a8 (gen:wait_resp_mon/3 + 64)
> CP: 0x00007f61c43645d8 (gen_server:call/3 + 160)
> arity = 0
>
> 0x00007f60f75818a0 Return addr 0x00007f61c43645d8 (gen_server:call/3 + 160)
> y(0) infinity
> y(1) #Ref<0.0.992.227035>
> y(2) 'ejabberd@REDACTED'
>
> 0x00007f60f75818c0 Return addr 0x00007f61c03b1610 (disk_log:do_exit/4 + 440)
> y(0) infinity
> y(1) {close,<0.22681.260>}
> y(2) disk_log_server
> y(3) Catch 0x00007f61c43645d8 (gen_server:call/3 + 160)
>
> 0x00007f60f75818e8 Return addr 0x00007f61c0e23fe8 (proc_lib:init_p/5 + 400)
> y(0) normal
> y(1) []
> y(2) <0.62.0>
> y(3) ok
>
> 0x00007f60f7581910 Return addr 0x000000000084bd18 (<terminate process normally>)
> y(0) Catch 0x00007f61c0e24008 (proc_lib:init_p/5 + 432)
> y(1) disk_log
> y(2) init
> y(3) [<0.70.0>,<0.71.0>]
>
> The disk_log process for 'decision_tab' was waiting for a reply from the
> disk_log_server to gen_server:call(disk_log_server, {close, self()}).
>
> Backtrace disk_log_server:
> Program counter: 0x00007f61c4365af8 (gen_server:loop/6 + 288)
> CP: 0x00007f61c0e23fe8 (proc_lib:init_p/5 + 400)
> arity = 0
>
> 0x00007f60fb043f78 Return addr 0x00007f61c0e23fe8 (proc_lib:init_p/5 + 400)
> y(0) []
> y(1) infinity
> y(2) disk_log_server
> y(3) {state,[]}
> y(4) disk_log_server
> y(5) <0.30.0>
>
> 0x00007f60fb043fb0 Return addr 0x000000000084bd18 (<terminate process normally>)
> y(0) Catch 0x00007f61c0e24008 (proc_lib:init_p/5 + 432)
> y(1) gen
> y(2) init_it
> y(3)
> [gen_server,<0.30.0>,<0.30.0>,{local,disk_log_server},disk_log_server,[],[]]
>
> process_info(whereis(disk_log_server)) =>
> [{registered_name,disk_log_server},
> {current_function,{gen_server,loop,6}},
> {initial_call,{proc_lib,init_p,5}},
> {status,waiting},
> {message_queue_len,1},
> {messages,[{'$gen_call',{<0.22681.260>,#Ref<0.0.992.227035>},
> {close,<0.22681.260>}}]},
> {links,[<0.111.0>,<0.22677.260>,<0.22681.260>,<0.30.0>]},
> {dictionary,[{<0.111.0>,latest_log},
> {<0.22677.260>,previous_log},
> {'$ancestors',[kernel_safe_sup,kernel_sup,<0.8.0>]},
> {<0.22681.260>,decision_tab},
> {'$initial_call',{gen,init_it,
> [gen_server,<0.30.0>,<0.30.0>,
> {local,disk_log_server},
> disk_log_server,[],[]]}}]},
> {trap_exit,true},
> {error_handler,error_handler},
> {priority,normal},
> {group_leader,<0.7.0>},
> {total_heap_size,246},
> {heap_size,233},
> {stack_size,12},
> {reductions,2366165},
> {garbage_collection,[{fullsweep_after,0},{minor_gcs,0}]},
> {suspending,[]}]
>
> Which appears to be doing something impossible - blocked in the receive
> statement in gen_server:loop/6 with a valid message in its queue. We
> used process_info to check the reductions a couple of times, but they
> stayed the same at 2366165 over a period of at least a minute.
>
> This line of investigation is all we have as the server has now been
> restarted.
>
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://www.erlang.org/mailman/listinfo/erlang-bugs
>
More information about the erlang-bugs
mailing list