Erlang node dies silently

Matthias Lang matthias@REDACTED
Tue Oct 8 08:53:23 CEST 2002


 > we are running a (live!) distributed Erlang application.
 > We have recently run into a problem where one of the Erlang nodes
 > dies very silently without any indication of what the problem might be.
 > It is not yet clear whether its killed by something internal to this
 > node or some external process - Solaris kernel ?  C process ? or 
 > something just falls appart in OTP

 > We have tried using truss and application specific traces, but for 
 > weeks now we are in the dark.  

Surely you aren't completely in the dark. If had 'truss' on the VM
when it died, you should at least know whether it was a normal or
abnormal exit and what it's exit value was.

More generally, you should be launching the VM in such a way that you
know why it exited. One way to do that is to write a C wrapper which
forks and then catches (and logs!) child deaths.

Assuming you're seeing an abnormal exit, the next port of call is the
erlang crash dump. Test whether or not your system generates crash
dumps by manually killing it:

       Eshell V5.1.2  (abort with ^G)
       1> halt("because I'm tired of life").

you should then get a crash dump starting with:

       <Erlang crash dump>
       Tue Oct  8 08:31:01 2002

       Slogan: because I'm tired of life

If you didn't get a crash dump, see 

     http://www.erlang.org/doc/r7b/erts-5.0.1/doc/html/crash_dump.html

and adjust your environment as necessary. By far the most common
reason I see the emulator crash is because it runs out of RAM. It
typically runs out of RAM because something went insane and started
building infinitely large data structures, but there are other ways to
do it, for instance:

    http://www.erlang.org/ml-archive/erlang-questions/200209/msg00170.html

the slogan in that case is something like "Failed to allocate 33 bytes".

Matthias



More information about the erlang-questions mailing list