[erlang-questions] How to debug "Kernel pid terminated"

David Mercer dmercer@REDACTED
Tue May 15 22:48:24 CEST 2012


I have a distributed application that I run on a couple of nodes.  I have
had various problems where one node spontaneously decides another node is
not available and starts up its own instance of the application, but this
one is a first for me: One of my failover nodes exited after printing the
following messages:

 

=ERROR REPORT==== 14-May-2012::19:43:24 ===

** Generic server dist_ac terminating 

** Last message in was {internal_restart_appl,cron}

** When Server state == {state,

                            [{appl,cron,

                                 {failover,cron_main@REDACTED},

                                 5000,

                                 [cron_main@REDACTED,

 
{cron_failover@REDACTED,cron_failover@REDACTED}],

                                 [{cron_failover@REDACTED,true}]}],

                            [],[],

                            [cron_failover@REDACTED],

                            [cron],

                            [],[],[],[],[]}

** Reason for termination == 

** {{case_clause,

        {'EXIT',

            {timeout,

                {gen_server,call,

                    [application_controller,which_applications]}}}},

    [{dist_ac,restart_appl,2,[{file,"dist_ac.erl"},{line,952}]},

     {dist_ac,handle_info,2,[{file,"dist_ac.erl"},{line,697}]},

     {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,597}]},

     {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}

 

=ERROR REPORT==== 14-May-2012::19:43:24 ===

    server: clickon_backup_server

    error: enoent

    path: <<"\\\\ftp-corp2\\SFTP-MW\\70350\\Upload\\837">>

{error_logger,{{2012,5,14},{19,43,25}},std_info,[{application,kernel},{exite
d,shutdown},{type,permanent}]}

{"Kernel pid
terminated",application_controller,"{application_terminated,kernel,shutdown}
"}

Kernel pid terminated (application_controller)
({application_terminated,kernel,shutdown})

 

 

Abnormal termination

 

I am guessing this node (cron_failover@REDACTED) somehow lost contact with the
main node (cron_main@REDACTED) on the same host.  I am not sure, however, why
this would cause the whole Erlang node to crash.  How would I go about
debugging this?  (1) What circumstances caused this node to lose contact
with the other node on the same host?  (2) What can I do to gracefully
handle this situation?

 

Here's my thought process so far, which doesn't really answer any of my
questions:

 

1.       The error message seems to point me to the case statement on line
952 of dist_ac.erl (restart_appl/2).  This is a call to start_appl/3, which
expects either {ok, _, _} or {error, _}, but not {'EXIT', .}, which is what
it received.

 

2.       Looking at start_appl/3, I doubt it is the keysearch which is
throwing the EXIT, so I'm going to assume that it is the call to
start_distributed/6.

 

3.       I can continue down this rabbit hole, but I'm not sure how it will
answer either of my questions.

 

Can someone who perhaps knows the workings of distributed applications
better than I please give me a few pointers?  Please advise.  Thank-you.

 

David Mercer

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20120515/64fc5166/attachment.htm>


More information about the erlang-questions mailing list