[erlang-questions] Stuck disk_log_server

Geoff Cant geoff.cant@REDACTED
Wed Jan 7 18:47:09 CET 2009

Hi all, I have discovered a weird problem on a customer cluster - mnesia
experiencing overload/dump_log problems that appear to be due to a
broken disk_log_server process. The system involved is R12B-3 running on
64 bit debian linux.

The backtrace and process_info for the disk_log_server process are:

Program counter: 0x00007f61c4365af8 (gen_server:loop/6 + 288)
CP: 0x00007f61c0e23fe8 (proc_lib:init_p/5 + 400)
arity = 0

0x00007f60fb043f78 Return addr 0x00007f61c0e23fe8 (proc_lib:init_p/5 + 400)
y(0)     []
y(1)     infinity
y(2)     disk_log_server
y(3)     {state,[]}
y(4)     disk_log_server
y(5)     <0.30.0>

0x00007f60fb043fb0 Return addr 0x000000000084bd18 (<terminate process normally>)
y(0)     Catch 0x00007f61c0e24008 (proc_lib:init_p/5 + 432)
y(1)     gen
y(2)     init_it
y(3)     [gen_server,<0.30.0>,<0.30.0>,{local,disk_log_server},disk_log_server,[],[]]


We think it's this process as mnesia_controller:get_workers(2000) shows
a dumper pid of <0.22676.260> whose backtrace showed it waiting for a
gen_call response in mnesia_log:save_decision_tab/1 (doing {close_log,
decision_tab}) from <0.62.0> (mnesia_monitor) which itself was waiting
for a reply in disk_log:monitor_request/2 from the decision_tab disk_log
process (<0.22681.260>) which was waiting for an answer to
gen_server:call(disk_log_server, {close, <0.22681.260>}) which seems to
be stuck in disk_log_server's message queue. 


The status of disk_log_server is 'waiting', but it has a message in its
queue and it appears to be sitting in gen_server:loop/6 - why wouldn't
it be making progress?

Any ideas would be appreciated here.

The server has been scheduled for a restart and we're hoping mnesia can
limp on until then without any log dumps (or probably any disk_log
activity on the node at all).

I investigated doing some kind of evil manual restart of the
disk_log_server process + manual repopulation of its ets tables and a
fake reply to the decision_log process, but disk_log_server lives under
kernel_safe_sup indicating that fooling with it is likely to have dire
consequences. For future reference, is it worth trying something like
that to save a node restart?

Geoff Cant

More information about the erlang-questions mailing list