[erlang-questions] How to debug "Kernel pid terminated"

Wed May 16 20:39:37 CEST 2012

Hi

I don't really have professional experience with Erlang yet, but for a
university project we looked into many levels of redundancy and fault
tolerance with Erlang. I also felt that too little of the intentions
of fault tolerance tools were explicitly documented, but the feeling I
got was that there are many cases where one might want to run two
nodes on one machine, when that isn't really useful.

These are all my opinion and they're not battle-proven, but I've given
this a bit of thought on a big project :)

The idea I have is that, for redundancy, you can run two nodes on
different machines, and perhaps failover a distributed application
that way. Two nodes on the same machine only helps if the node
crashes. That can happen, but I think that's a more serious issue that
should be found and fixed theoretically rather than trying to do
failover. A second node on the machine might be affected anyway, since
the only VM crash I've seen has been when it ran out of memory - the
other nodes on the machine didn't have memory either. We ended up
having to do some explicit garbage collection, although the mistake
was using a process pool rather than spawning a process for each job.

Instead the processes in the application should be supervised
properly. Since your supervisors should only have one task -
supervision - it should be very easy to ensure they are "perfect". If
they crash, the supervisor above them can restart the subtree, etc.
The top supervisor should never need to crash since its job should be
the most simple - as long as the top supervisor is there, the
application stays up and functional units will be restarted if they
crash. As long as you don't crash the VM, the VM and the application
should stay up. If the application _does_ crash, you can set the
release to shutdown when a critical application goes down, and the
next layer can kick in: the hot standby VM on another machine, or a
restarted VM using heartbeat, or perhaps your hot standby on the same
machine.

If you have a hot standby machine+vm+distributed application, that can
then kick in. As to the nodes getting confused about whether the other
is online - I have no idea. I guess an intermittent connection is a
big problem for a distributed application, but I think we ended up not
using distributed applications - I have no experience of them.

Good luck!

JD

On 16 May 2012 19:00, David Mercer <dmercer@REDACTED> wrote:
> As a follow-up question, since I had a problem again overnight where the
> failover took over for the main, even though the main was still running: Are
> Erlang distributed applications not intended to be run on multiple nodes on
> the same host?
>
>
>
> Anyone have any success doing this in production?  I can get it to work, it
> just doesn’t seem to work long-term.
>
>
>
> I guess I don’t often see any posts on this list about the built-in
> distributed application functionality of Erlang/OTP.  Does anyone actually
> use it, or am I behind the times and I should be using some sort of custom
> system developed by the RabbitMQ folks or something?  Just wondering,
> because it makes a really good demo when I show people; it just doesn’t seem
> to be working for me long-term.
>
>
>
> Cheers,
>
>
> DBM
>
>
>
> From: David Mercer [mailto:dmercer@REDACTED]
> Sent: Tuesday, May 15, 2012 3:48 PM
> To: erlang-questions@REDACTED
> Subject: How to debug "Kernel pid terminated"
>
>
>
> I have a distributed application that I run on a couple of nodes.  I have
> had various problems where one node spontaneously decides another node is
> not available and starts up its own instance of the application, but this
> one is a first for me: One of my failover nodes exited after printing the
> following messages:
>
>
>
> =ERROR REPORT==== 14-May-2012::19:43:24 ===
>
> ** Generic server dist_ac terminating
>
> ** Last message in was {internal_restart_appl,cron}
>
> ** When Server state == {state,
>
>                             [{appl,cron,
>
>                                  {failover,cron_main@REDACTED},
>
>                                  5000,
>
>                                  [cron_main@REDACTED,
>
>
> {cron_failover@REDACTED,cron_failover@REDACTED}],
>
>                                  [{cron_failover@REDACTED,true}]}],
>
>                             [],[],
>
>                             [cron_failover@REDACTED],
>
>                             [cron],
>
>                             [],[],[],[],[]}
>
> ** Reason for termination ==
>
> ** {{case_clause,
>
>         {'EXIT',
>
>             {timeout,
>
>                 {gen_server,call,
>
>                     [application_controller,which_applications]}}}},
>
>     [{dist_ac,restart_appl,2,[{file,"dist_ac.erl"},{line,952}]},
>
>      {dist_ac,handle_info,2,[{file,"dist_ac.erl"},{line,697}]},
>
>      {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,597}]},
>
>      {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}
>
>
>
> =ERROR REPORT==== 14-May-2012::19:43:24 ===
>
>     server: clickon_backup_server
>
>     error: enoent
>
>     path: <<"\\\\ftp-corp2\\SFTP-MW\\70350\\Upload\\837">>
>
> {error_logger,{{2012,5,14},{19,43,25}},std_info,[{application,kernel},{exited,shutdown},{type,permanent}]}
>
> {"Kernel pid
> terminated",application_controller,"{application_terminated,kernel,shutdown}"}
>
> Kernel pid terminated (application_controller)
> ({application_terminated,kernel,shutdown})
>
>
>
>
>
> Abnormal termination
>
>
>
> I am guessing this node (cron_failover@REDACTED) somehow lost contact with the
> main node (cron_main@REDACTED) on the same host.  I am not sure, however, why
> this would cause the whole Erlang node to crash.  How would I go about
> debugging this?  (1) What circumstances caused this node to lose contact
> with the other node on the same host?  (2) What can I do to gracefully
> handle this situation?
>
>
>
> Here’s my thought process so far, which doesn’t really answer any of my
> questions:
>
>
>
> 1.       The error message seems to point me to the case statement on line
> 952 of dist_ac.erl (restart_appl/2).  This is a call to start_appl/3, which
> expects either {ok, _, _} or {error, _}, but not {'EXIT', …}, which is what
> it received.
>
>
>
> 2.       Looking at start_appl/3, I doubt it is the keysearch which is
> throwing the EXIT, so I’m going to assume that it is the call to
> start_distributed/6.
>
>
>
> 3.       I can continue down this rabbit hole, but I’m not sure how it will
> answer either of my questions.
>
>
>
> Can someone who perhaps knows the workings of distributed applications
> better than I please give me a few pointers?  Please advise.  Thank-you.
>
>
>
> David Mercer
>
>
>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>