[erlang-questions] Stuck disk_log_server

Wed Jan 7 18:47:09 CET 2009

Hi all, I have discovered a weird problem on a customer cluster - mnesia
experiencing overload/dump_log problems that appear to be due to a
broken disk_log_server process. The system involved is R12B-3 running on
64 bit debian linux.

The backtrace and process_info for the disk_log_server process are:

Program counter: 0x00007f61c4365af8 (gen_server:loop/6 + 288)
CP: 0x00007f61c0e23fe8 (proc_lib:init_p/5 + 400)
arity = 0

0x00007f60fb043f78 Return addr 0x00007f61c0e23fe8 (proc_lib:init_p/5 + 400)
y(0)     []
y(1)     infinity
y(2)     disk_log_server
y(3)     {state,[]}
y(4)     disk_log_server
y(5)     <0.30.0>

0x00007f60fb043fb0 Return addr 0x000000000084bd18 (<terminate process normally>)
y(0)     Catch 0x00007f61c0e24008 (proc_lib:init_p/5 + 432)
y(1)     gen
y(2)     init_it
y(3)     [gen_server,<0.30.0>,<0.30.0>,{local,disk_log_server},disk_log_server,[],[]]

process_info(whereis(disk_log_server)).            
[{registered_name,disk_log_server},
 {current_function,{gen_server,loop,6}},
 {initial_call,{proc_lib,init_p,5}},
 {status,waiting},
 {message_queue_len,1},
 {messages,[{'$gen_call',{<0.22681.260>,#Ref<0.0.992.227035>},
                         {close,<0.22681.260>}}]},
 {links,[<0.111.0>,<0.22677.260>,<0.22681.260>,<0.30.0>]},
 {dictionary,[{<0.111.0>,latest_log},
              {<0.22677.260>,previous_log},
              {'$ancestors',[kernel_safe_sup,kernel_sup,<0.8.0>]},
              {<0.22681.260>,decision_tab},
              {'$initial_call',{gen,init_it,
                                    [gen_server,<0.30.0>,<0.30.0>,
                                     {local,disk_log_server},
                                     disk_log_server,[],[]]}}]},
 {trap_exit,true},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<0.7.0>},
 {total_heap_size,246},
 {heap_size,233},
 {stack_size,12},
 {reductions,2366165},
 {garbage_collection,[{fullsweep_after,0},{minor_gcs,0}]},
 {suspending,[]}]

We think it's this process as mnesia_controller:get_workers(2000) shows
a dumper pid of <0.22676.260> whose backtrace showed it waiting for a
gen_call response in mnesia_log:save_decision_tab/1 (doing {close_log,
decision_tab}) from <0.62.0> (mnesia_monitor) which itself was waiting
for a reply in disk_log:monitor_request/2 from the decision_tab disk_log
process (<0.22681.260>) which was waiting for an answer to
gen_server:call(disk_log_server, {close, <0.22681.260>}) which seems to
be stuck in disk_log_server's message queue. 

*phew*

The status of disk_log_server is 'waiting', but it has a message in its
queue and it appears to be sitting in gen_server:loop/6 - why wouldn't
it be making progress?

Any ideas would be appreciated here.

The server has been scheduled for a restart and we're hoping mnesia can
limp on until then without any log dumps (or probably any disk_log
activity on the node at all).

I investigated doing some kind of evil manual restart of the
disk_log_server process + manual repopulation of its ets tables and a
fake reply to the decision_log process, but disk_log_server lives under
kernel_safe_sup indicating that fooling with it is likely to have dire
consequences. For future reference, is it worth trying something like
that to save a node restart?

Cheers,
-- 
Geoff Cant