[erlang-bugs] R12B-3/64bit/smp Stuck disk_log_server

Dan Gudmundsson dgud@REDACTED
Thu Jan 8 15:55:58 CET 2009


I have seen it on erl-q (most of us devs read that list to),
your reasoning seems valid and currently I don't have any more ideas.

I have asked the emulator guys to take a look.

/Dan "Mnesia" G

Geoff Cant wrote:
> Hi all, Mats Cronqvist suggested I take this one up on erlang-bugs. What
> follows is a rough transcript of a debugging session in which we suspect
> that the reason an ejabberd node cannot dump mnesia logs is due to the
> disk_log_server process being impossibly stuck in gen_server:loop/6.
> 
> It would be good if someone could confirm for me that my reasoning is
> correct (or at least plausible) that the disk_log_server is stuck, that
> this is the reason why mnesia can't dump logs and that the
> disk_log_server is stuck in a seemingly impossible way.
> 
> The client on whose cluster this occurred has seen this problem before,
> so we may get another chance at live debugging sometime in the near
> future.
> 
> I would greatly appreciate any suggestions as to additional debugging
> techniques I could try if this problem recurs.
> 
> Thank you,
> --Geoff Cant
> 
> 
> The erlang version information is "Erlang (BEAM) emulator version 5.6.3
> [source] [64-bit] [smp:8] [async-threads:0] [hipe] [kernel-poll:false]"
>  - the stock Debian erlang-hipe-base from lenny on amd64 hardware.
> 
> (in the transcript nodenames and file paths have been slightly but
> consistently rewritten to obscure some private network information)
> 
> We tried mnesia:dump_log() which hung, so we tried to figure out why.
> 
> mnesia_controller:get_workers(2000) => {workers,[],[],<0.22676.260>}
> 
> process_info(<0.22676.260>) =>
> [{current_function,{gen,wait_resp_mon,3}},
>  {initial_call,{mnesia_controller,dump_and_reply,2}},
>  {status,waiting},
>  {message_queue_len,0},
>  {messages,[]},
>  {links,[<0.116.0>]},
>  {dictionary,[]},
>  {trap_exit,false},
>  {error_handler,error_handler},
>  {priority,normal},
>  {group_leader,<0.57.0>},
>  {total_heap_size,233},
>  {heap_size,233},
>  {stack_size,21},
>  {reductions,4311},
>  {garbage_collection,[{fullsweep_after,0},{minor_gcs,0}]},
>  {suspending,[]}]
> 
> Backtrace <0.22676.260>:
> Program counter: 0x00007f61c0e0c2a8 (gen:wait_resp_mon/3 + 64)
> CP: 0x00007f61c43645d8 (gen_server:call/3 + 160)
> arity = 0
> 
> 0x00007f60f844c108 Return addr 0x00007f61c43645d8 (gen_server:call/3 + 160)
> y(0)     infinity
> y(1)     #Ref<0.0.992.227032>
> y(2)     'ejabberd@REDACTED'
> 
> 0x00007f60f844c128 Return addr 0x00007f61c049f1d0 (mnesia_log:save_decision_tab/1 + 248)
> y(0)     infinity
> y(1)     {close_log,decision_tab}
> y(2)     <0.62.0>
> y(3)     Catch 0x00007f61c43645d8 (gen_server:call/3 + 160)
> 
> 0x00007f60f844c150 Return addr 0x00007f61c03c6ec8 (mnesia_dumper:perform_dump/2 + 1648)
> y(0)     "/fake/path/ejabberd/DECISION_TAB.TMP"
> y(1)     []
> 
> 0x00007f60f844c168 Return addr 0x00007f61c056bdd0 (mnesia_controller:dump_and_reply/2 + 152)
> y(0)     []
> y(1)     []
> y(2)     []
> y(3)     15
> y(4)     []
> y(5)     []
> 
> 0x00007f60f844c1a0 Return addr 0x000000000084bd18 (<terminate process normally>)
> y(0)     <0.116.0>
> 
> Here the log dumping process appears to be waiting on
> gen_server:call(mnesia_monitor, {close_log,decision_tab}).
> 
> process_info(<0.62.0>) =>
> [{registered_name,mnesia_monitor},
>  {current_function,{disk_log,monitor_request,2}},
>  {initial_call,{proc_lib,init_p,5}},
>  {status,waiting},
>  {message_queue_len,34},
>  {messages,[{nodeup,'gc@REDACTED'},
>             {nodedown,'gc@REDACTED'},
>             {nodeup,'fakenode1-16-26-06@REDACTED'},
>             {nodedown,'fakenode1-16-26-06@REDACTED'},
>             {nodeup,'fakenode1-16-27-20@REDACTED'},
>             {nodedown,'fakenode1-16-27-20@REDACTED'},
>             {nodeup,'fakenode1-16-29-25@REDACTED'},
>             {nodedown,'fakenode1-16-29-25@REDACTED'},
>             {nodeup,'gc@REDACTED'},
>             {nodedown,'gc@REDACTED'},
>             {nodeup,'fakenode2-16-36-53@REDACTED'},
>             {nodeup,'gc@REDACTED'},
>             {nodedown,'gc@REDACTED'},
>             {nodeup,'gc@REDACTED'},
>             {nodedown,'gc@REDACTED'},
>             {nodeup,'gc@REDACTED'},
>             {nodedown,'gc@REDACTED'},
>             {nodeup,'gc@REDACTED'},
>             {nodedown,'gc@REDACTED'},
>             {nodeup,...},
>             {...}|...]},
>  {links,[<6749.62.0>,<6753.62.0>,<0.111.0>,<0.22677.260>,
>          <6752.104.0>,<6747.62.0>,<6748.62.0>,<0.61.0>,<6751.62.0>,
>          <6750.62.0>,<0.52.0>]},
>  {dictionary,[{'$ancestors',[mnesia_kernel_sup,mnesia_sup,
>                              <0.58.0>]},
>               {'$initial_call',{gen,init_it,
>                                     [gen_server,<0.61.0>,<0.61.0>,
>                                      {local,mnesia_monitor},
>                                      mnesia_monitor,
>                                      [<0.61.0>],
>                                      [{timeout,infinity}]]}}]},
>  {trap_exit,true},
>  {error_handler,error_handler},
>  {priority,normal},
>  {group_leader,<0.57.0>},
>  {total_heap_size,377},
>  {heap_size,377},
>  {stack_size,20},
>  {reductions,2326000},
>  {garbage_collection,[{fullsweep_after,0},{minor_gcs,0}]},
>  {suspending,[]}]
> 
> We didn't take a backtrace of mnesia_monitor, but
> {current_function,{disk_log,monitor_request,2}} led us to think that
> mnesia_monitor was trying to close the decision_tab log file, so we
> tried to find out which process that was. At this point,
> disk_log:info(decision_tab) hung, so we tried
> disk_log_server:get_log_pids(decision_tab) which gave us
> {local,<0.22681.260>}.
> 
> Backtrace <0.22681.260>:
> Program counter: 0x00007f61c0e0c2a8 (gen:wait_resp_mon/3 + 64)
> CP: 0x00007f61c43645d8 (gen_server:call/3 + 160)
> arity = 0
> 
> 0x00007f60f75818a0 Return addr 0x00007f61c43645d8 (gen_server:call/3 + 160)
> y(0)     infinity
> y(1)     #Ref<0.0.992.227035>
> y(2)     'ejabberd@REDACTED'
> 
> 0x00007f60f75818c0 Return addr 0x00007f61c03b1610 (disk_log:do_exit/4 + 440)
> y(0)     infinity
> y(1)     {close,<0.22681.260>}
> y(2)     disk_log_server
> y(3)     Catch 0x00007f61c43645d8 (gen_server:call/3 + 160)
> 
> 0x00007f60f75818e8 Return addr 0x00007f61c0e23fe8 (proc_lib:init_p/5 + 400)
> y(0)     normal
> y(1)     []
> y(2)     <0.62.0>
> y(3)     ok
> 
> 0x00007f60f7581910 Return addr 0x000000000084bd18 (<terminate process normally>)
> y(0)     Catch 0x00007f61c0e24008 (proc_lib:init_p/5 + 432)
> y(1)     disk_log
> y(2)     init
> y(3)     [<0.70.0>,<0.71.0>]
> 
> The disk_log process for 'decision_tab' was waiting for a reply from the
> disk_log_server to gen_server:call(disk_log_server, {close, self()}).
> 
> Backtrace disk_log_server:
> Program counter: 0x00007f61c4365af8 (gen_server:loop/6 + 288)
> CP: 0x00007f61c0e23fe8 (proc_lib:init_p/5 + 400)
> arity = 0
> 
> 0x00007f60fb043f78 Return addr 0x00007f61c0e23fe8 (proc_lib:init_p/5 + 400)
> y(0)     []
> y(1)     infinity
> y(2)     disk_log_server
> y(3)     {state,[]}
> y(4)     disk_log_server
> y(5)     <0.30.0>
> 
> 0x00007f60fb043fb0 Return addr 0x000000000084bd18 (<terminate process normally>)
> y(0)     Catch 0x00007f61c0e24008 (proc_lib:init_p/5 + 432)
> y(1)     gen
> y(2)     init_it
> y(3)
> [gen_server,<0.30.0>,<0.30.0>,{local,disk_log_server},disk_log_server,[],[]]
> 
> process_info(whereis(disk_log_server)) =>
> [{registered_name,disk_log_server},
>  {current_function,{gen_server,loop,6}},
>  {initial_call,{proc_lib,init_p,5}},
>  {status,waiting},
>  {message_queue_len,1},
>  {messages,[{'$gen_call',{<0.22681.260>,#Ref<0.0.992.227035>},
>                          {close,<0.22681.260>}}]},
>  {links,[<0.111.0>,<0.22677.260>,<0.22681.260>,<0.30.0>]},
>  {dictionary,[{<0.111.0>,latest_log},
>               {<0.22677.260>,previous_log},
>               {'$ancestors',[kernel_safe_sup,kernel_sup,<0.8.0>]},
>               {<0.22681.260>,decision_tab},
>               {'$initial_call',{gen,init_it,
>                                     [gen_server,<0.30.0>,<0.30.0>,
>                                      {local,disk_log_server},
>                                      disk_log_server,[],[]]}}]},
>  {trap_exit,true},
>  {error_handler,error_handler},
>  {priority,normal},
>  {group_leader,<0.7.0>},
>  {total_heap_size,246},
>  {heap_size,233},
>  {stack_size,12},
>  {reductions,2366165},
>  {garbage_collection,[{fullsweep_after,0},{minor_gcs,0}]},
>  {suspending,[]}]
> 
> Which appears to be doing something impossible - blocked in the receive
> statement in gen_server:loop/6 with a valid message in its queue. We
> used process_info to check the reductions a couple of times, but they
> stayed the same at 2366165 over a period of at least a minute.
> 
> This line of investigation is all we have as the server has now been
> restarted.
> 
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://www.erlang.org/mailman/listinfo/erlang-bugs
> 



More information about the erlang-bugs mailing list