[erlang-questions] How to debug "Kernel pid terminated"
David Mercer
dmercer@REDACTED
Tue May 15 22:48:24 CEST 2012
I have a distributed application that I run on a couple of nodes. I have
had various problems where one node spontaneously decides another node is
not available and starts up its own instance of the application, but this
one is a first for me: One of my failover nodes exited after printing the
following messages:
=ERROR REPORT==== 14-May-2012::19:43:24 ===
** Generic server dist_ac terminating
** Last message in was {internal_restart_appl,cron}
** When Server state == {state,
[{appl,cron,
{failover,cron_main@REDACTED},
5000,
[cron_main@REDACTED,
{cron_failover@REDACTED,cron_failover@REDACTED}],
[{cron_failover@REDACTED,true}]}],
[],[],
[cron_failover@REDACTED],
[cron],
[],[],[],[],[]}
** Reason for termination ==
** {{case_clause,
{'EXIT',
{timeout,
{gen_server,call,
[application_controller,which_applications]}}}},
[{dist_ac,restart_appl,2,[{file,"dist_ac.erl"},{line,952}]},
{dist_ac,handle_info,2,[{file,"dist_ac.erl"},{line,697}]},
{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,597}]},
{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}
=ERROR REPORT==== 14-May-2012::19:43:24 ===
server: clickon_backup_server
error: enoent
path: <<"\\\\ftp-corp2\\SFTP-MW\\70350\\Upload\\837">>
{error_logger,{{2012,5,14},{19,43,25}},std_info,[{application,kernel},{exite
d,shutdown},{type,permanent}]}
{"Kernel pid
terminated",application_controller,"{application_terminated,kernel,shutdown}
"}
Kernel pid terminated (application_controller)
({application_terminated,kernel,shutdown})
Abnormal termination
I am guessing this node (cron_failover@REDACTED) somehow lost contact with the
main node (cron_main@REDACTED) on the same host. I am not sure, however, why
this would cause the whole Erlang node to crash. How would I go about
debugging this? (1) What circumstances caused this node to lose contact
with the other node on the same host? (2) What can I do to gracefully
handle this situation?
Here's my thought process so far, which doesn't really answer any of my
questions:
1. The error message seems to point me to the case statement on line
952 of dist_ac.erl (restart_appl/2). This is a call to start_appl/3, which
expects either {ok, _, _} or {error, _}, but not {'EXIT', .}, which is what
it received.
2. Looking at start_appl/3, I doubt it is the keysearch which is
throwing the EXIT, so I'm going to assume that it is the call to
start_distributed/6.
3. I can continue down this rabbit hole, but I'm not sure how it will
answer either of my questions.
Can someone who perhaps knows the workings of distributed applications
better than I please give me a few pointers? Please advise. Thank-you.
David Mercer
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20120515/64fc5166/attachment.htm>
More information about the erlang-questions
mailing list