[erlang-bugs] R12B-3/64bit/smp Stuck disk_log_server

Geoff Cant geoff.cant@REDACTED
Thu Jan 8 15:36:16 CET 2009


Hi all, Mats Cronqvist suggested I take this one up on erlang-bugs. What
follows is a rough transcript of a debugging session in which we suspect
that the reason an ejabberd node cannot dump mnesia logs is due to the
disk_log_server process being impossibly stuck in gen_server:loop/6.

It would be good if someone could confirm for me that my reasoning is
correct (or at least plausible) that the disk_log_server is stuck, that
this is the reason why mnesia can't dump logs and that the
disk_log_server is stuck in a seemingly impossible way.

The client on whose cluster this occurred has seen this problem before,
so we may get another chance at live debugging sometime in the near
future.

I would greatly appreciate any suggestions as to additional debugging
techniques I could try if this problem recurs.

Thank you,
--Geoff Cant


The erlang version information is "Erlang (BEAM) emulator version 5.6.3
[source] [64-bit] [smp:8] [async-threads:0] [hipe] [kernel-poll:false]"
 - the stock Debian erlang-hipe-base from lenny on amd64 hardware.

(in the transcript nodenames and file paths have been slightly but
consistently rewritten to obscure some private network information)

We tried mnesia:dump_log() which hung, so we tried to figure out why.

mnesia_controller:get_workers(2000) => {workers,[],[],<0.22676.260>}

process_info(<0.22676.260>) =>
[{current_function,{gen,wait_resp_mon,3}},
 {initial_call,{mnesia_controller,dump_and_reply,2}},
 {status,waiting},
 {message_queue_len,0},
 {messages,[]},
 {links,[<0.116.0>]},
 {dictionary,[]},
 {trap_exit,false},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<0.57.0>},
 {total_heap_size,233},
 {heap_size,233},
 {stack_size,21},
 {reductions,4311},
 {garbage_collection,[{fullsweep_after,0},{minor_gcs,0}]},
 {suspending,[]}]

Backtrace <0.22676.260>:
Program counter: 0x00007f61c0e0c2a8 (gen:wait_resp_mon/3 + 64)
CP: 0x00007f61c43645d8 (gen_server:call/3 + 160)
arity = 0

0x00007f60f844c108 Return addr 0x00007f61c43645d8 (gen_server:call/3 + 160)
y(0)     infinity
y(1)     #Ref<0.0.992.227032>
y(2)     'ejabberd@REDACTED'

0x00007f60f844c128 Return addr 0x00007f61c049f1d0 (mnesia_log:save_decision_tab/1 + 248)
y(0)     infinity
y(1)     {close_log,decision_tab}
y(2)     <0.62.0>
y(3)     Catch 0x00007f61c43645d8 (gen_server:call/3 + 160)

0x00007f60f844c150 Return addr 0x00007f61c03c6ec8 (mnesia_dumper:perform_dump/2 + 1648)
y(0)     "/fake/path/ejabberd/DECISION_TAB.TMP"
y(1)     []

0x00007f60f844c168 Return addr 0x00007f61c056bdd0 (mnesia_controller:dump_and_reply/2 + 152)
y(0)     []
y(1)     []
y(2)     []
y(3)     15
y(4)     []
y(5)     []

0x00007f60f844c1a0 Return addr 0x000000000084bd18 (<terminate process normally>)
y(0)     <0.116.0>

Here the log dumping process appears to be waiting on
gen_server:call(mnesia_monitor, {close_log,decision_tab}).

process_info(<0.62.0>) =>
[{registered_name,mnesia_monitor},
 {current_function,{disk_log,monitor_request,2}},
 {initial_call,{proc_lib,init_p,5}},
 {status,waiting},
 {message_queue_len,34},
 {messages,[{nodeup,'gc@REDACTED'},
            {nodedown,'gc@REDACTED'},
            {nodeup,'fakenode1-16-26-06@REDACTED'},
            {nodedown,'fakenode1-16-26-06@REDACTED'},
            {nodeup,'fakenode1-16-27-20@REDACTED'},
            {nodedown,'fakenode1-16-27-20@REDACTED'},
            {nodeup,'fakenode1-16-29-25@REDACTED'},
            {nodedown,'fakenode1-16-29-25@REDACTED'},
            {nodeup,'gc@REDACTED'},
            {nodedown,'gc@REDACTED'},
            {nodeup,'fakenode2-16-36-53@REDACTED'},
            {nodeup,'gc@REDACTED'},
            {nodedown,'gc@REDACTED'},
            {nodeup,'gc@REDACTED'},
            {nodedown,'gc@REDACTED'},
            {nodeup,'gc@REDACTED'},
            {nodedown,'gc@REDACTED'},
            {nodeup,'gc@REDACTED'},
            {nodedown,'gc@REDACTED'},
            {nodeup,...},
            {...}|...]},
 {links,[<6749.62.0>,<6753.62.0>,<0.111.0>,<0.22677.260>,
         <6752.104.0>,<6747.62.0>,<6748.62.0>,<0.61.0>,<6751.62.0>,
         <6750.62.0>,<0.52.0>]},
 {dictionary,[{'$ancestors',[mnesia_kernel_sup,mnesia_sup,
                             <0.58.0>]},
              {'$initial_call',{gen,init_it,
                                    [gen_server,<0.61.0>,<0.61.0>,
                                     {local,mnesia_monitor},
                                     mnesia_monitor,
                                     [<0.61.0>],
                                     [{timeout,infinity}]]}}]},
 {trap_exit,true},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<0.57.0>},
 {total_heap_size,377},
 {heap_size,377},
 {stack_size,20},
 {reductions,2326000},
 {garbage_collection,[{fullsweep_after,0},{minor_gcs,0}]},
 {suspending,[]}]

We didn't take a backtrace of mnesia_monitor, but
{current_function,{disk_log,monitor_request,2}} led us to think that
mnesia_monitor was trying to close the decision_tab log file, so we
tried to find out which process that was. At this point,
disk_log:info(decision_tab) hung, so we tried
disk_log_server:get_log_pids(decision_tab) which gave us
{local,<0.22681.260>}.

Backtrace <0.22681.260>:
Program counter: 0x00007f61c0e0c2a8 (gen:wait_resp_mon/3 + 64)
CP: 0x00007f61c43645d8 (gen_server:call/3 + 160)
arity = 0

0x00007f60f75818a0 Return addr 0x00007f61c43645d8 (gen_server:call/3 + 160)
y(0)     infinity
y(1)     #Ref<0.0.992.227035>
y(2)     'ejabberd@REDACTED'

0x00007f60f75818c0 Return addr 0x00007f61c03b1610 (disk_log:do_exit/4 + 440)
y(0)     infinity
y(1)     {close,<0.22681.260>}
y(2)     disk_log_server
y(3)     Catch 0x00007f61c43645d8 (gen_server:call/3 + 160)

0x00007f60f75818e8 Return addr 0x00007f61c0e23fe8 (proc_lib:init_p/5 + 400)
y(0)     normal
y(1)     []
y(2)     <0.62.0>
y(3)     ok

0x00007f60f7581910 Return addr 0x000000000084bd18 (<terminate process normally>)
y(0)     Catch 0x00007f61c0e24008 (proc_lib:init_p/5 + 432)
y(1)     disk_log
y(2)     init
y(3)     [<0.70.0>,<0.71.0>]

The disk_log process for 'decision_tab' was waiting for a reply from the
disk_log_server to gen_server:call(disk_log_server, {close, self()}).

Backtrace disk_log_server:
Program counter: 0x00007f61c4365af8 (gen_server:loop/6 + 288)
CP: 0x00007f61c0e23fe8 (proc_lib:init_p/5 + 400)
arity = 0

0x00007f60fb043f78 Return addr 0x00007f61c0e23fe8 (proc_lib:init_p/5 + 400)
y(0)     []
y(1)     infinity
y(2)     disk_log_server
y(3)     {state,[]}
y(4)     disk_log_server
y(5)     <0.30.0>

0x00007f60fb043fb0 Return addr 0x000000000084bd18 (<terminate process normally>)
y(0)     Catch 0x00007f61c0e24008 (proc_lib:init_p/5 + 432)
y(1)     gen
y(2)     init_it
y(3)
[gen_server,<0.30.0>,<0.30.0>,{local,disk_log_server},disk_log_server,[],[]]

process_info(whereis(disk_log_server)) =>
[{registered_name,disk_log_server},
 {current_function,{gen_server,loop,6}},
 {initial_call,{proc_lib,init_p,5}},
 {status,waiting},
 {message_queue_len,1},
 {messages,[{'$gen_call',{<0.22681.260>,#Ref<0.0.992.227035>},
                         {close,<0.22681.260>}}]},
 {links,[<0.111.0>,<0.22677.260>,<0.22681.260>,<0.30.0>]},
 {dictionary,[{<0.111.0>,latest_log},
              {<0.22677.260>,previous_log},
              {'$ancestors',[kernel_safe_sup,kernel_sup,<0.8.0>]},
              {<0.22681.260>,decision_tab},
              {'$initial_call',{gen,init_it,
                                    [gen_server,<0.30.0>,<0.30.0>,
                                     {local,disk_log_server},
                                     disk_log_server,[],[]]}}]},
 {trap_exit,true},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<0.7.0>},
 {total_heap_size,246},
 {heap_size,233},
 {stack_size,12},
 {reductions,2366165},
 {garbage_collection,[{fullsweep_after,0},{minor_gcs,0}]},
 {suspending,[]}]

Which appears to be doing something impossible - blocked in the receive
statement in gen_server:loop/6 with a valid message in its queue. We
used process_info to check the reductions a couple of times, but they
stayed the same at 2366165 over a period of at least a minute.

This line of investigation is all we have as the server has now been
restarted.




More information about the erlang-bugs mailing list