[erlang-questions] Stuck disk_log_server
Geoff Cant
geoff.cant@REDACTED
Wed Jan 7 18:47:09 CET 2009
Hi all, I have discovered a weird problem on a customer cluster - mnesia
experiencing overload/dump_log problems that appear to be due to a
broken disk_log_server process. The system involved is R12B-3 running on
64 bit debian linux.
The backtrace and process_info for the disk_log_server process are:
Program counter: 0x00007f61c4365af8 (gen_server:loop/6 + 288)
CP: 0x00007f61c0e23fe8 (proc_lib:init_p/5 + 400)
arity = 0
0x00007f60fb043f78 Return addr 0x00007f61c0e23fe8 (proc_lib:init_p/5 + 400)
y(0) []
y(1) infinity
y(2) disk_log_server
y(3) {state,[]}
y(4) disk_log_server
y(5) <0.30.0>
0x00007f60fb043fb0 Return addr 0x000000000084bd18 (<terminate process normally>)
y(0) Catch 0x00007f61c0e24008 (proc_lib:init_p/5 + 432)
y(1) gen
y(2) init_it
y(3) [gen_server,<0.30.0>,<0.30.0>,{local,disk_log_server},disk_log_server,[],[]]
process_info(whereis(disk_log_server)).
[{registered_name,disk_log_server},
{current_function,{gen_server,loop,6}},
{initial_call,{proc_lib,init_p,5}},
{status,waiting},
{message_queue_len,1},
{messages,[{'$gen_call',{<0.22681.260>,#Ref<0.0.992.227035>},
{close,<0.22681.260>}}]},
{links,[<0.111.0>,<0.22677.260>,<0.22681.260>,<0.30.0>]},
{dictionary,[{<0.111.0>,latest_log},
{<0.22677.260>,previous_log},
{'$ancestors',[kernel_safe_sup,kernel_sup,<0.8.0>]},
{<0.22681.260>,decision_tab},
{'$initial_call',{gen,init_it,
[gen_server,<0.30.0>,<0.30.0>,
{local,disk_log_server},
disk_log_server,[],[]]}}]},
{trap_exit,true},
{error_handler,error_handler},
{priority,normal},
{group_leader,<0.7.0>},
{total_heap_size,246},
{heap_size,233},
{stack_size,12},
{reductions,2366165},
{garbage_collection,[{fullsweep_after,0},{minor_gcs,0}]},
{suspending,[]}]
We think it's this process as mnesia_controller:get_workers(2000) shows
a dumper pid of <0.22676.260> whose backtrace showed it waiting for a
gen_call response in mnesia_log:save_decision_tab/1 (doing {close_log,
decision_tab}) from <0.62.0> (mnesia_monitor) which itself was waiting
for a reply in disk_log:monitor_request/2 from the decision_tab disk_log
process (<0.22681.260>) which was waiting for an answer to
gen_server:call(disk_log_server, {close, <0.22681.260>}) which seems to
be stuck in disk_log_server's message queue.
*phew*
The status of disk_log_server is 'waiting', but it has a message in its
queue and it appears to be sitting in gen_server:loop/6 - why wouldn't
it be making progress?
Any ideas would be appreciated here.
The server has been scheduled for a restart and we're hoping mnesia can
limp on until then without any log dumps (or probably any disk_log
activity on the node at all).
I investigated doing some kind of evil manual restart of the
disk_log_server process + manual repopulation of its ets tables and a
fake reply to the decision_log process, but disk_log_server lives under
kernel_safe_sup indicating that fooling with it is likely to have dire
consequences. For future reference, is it worth trying something like
that to save a node restart?
Cheers,
--
Geoff Cant
More information about the erlang-questions
mailing list