[erlang-questions] Sudden death of Erlang Node

Thu Jan 18 16:26:29 CET 2007

Eranga Udesh wrote:
> Hi,
> 
> I have a very busy Erlang node running in a Quad Proc server with plenty of
> Ram. The server utilization is quite normal. 

You indicate that you have a "very busy" node, yet it's utilization is 
"quite normal".  I find these definition contradictory.  Could you 
define the peak utilization in CPU percentage consumption?  If it is, 
say, over 90% that can't be considered normal.

> However time to time, the
> Erlang node goes to sudden death without any warnings. The erlang.log.x log
> files only show that the "heart" couldn't kill the server and the node
> restarting info. Also I cannot find any erl_crash.dump file. Later I
> introduced ERL_CRASH_DUMP and ERL_CRASH_DUMP_SECONDS environment variable
> with different settings, but no luck. I use Erlang version 11B-2.

We've experienced a similar issue intermittently with R11B-0 (without 
SMP - which is what we are running in production).  The details can be 
found in this thread:

http://www.erlang.org/pipermail/erlang-questions/2006-December/024365.html

Are you seeing the following message in the log?

    "heart: Wed Dec 13 18:59:54 2006: Erlang has closed."

I managed to reproduce a similar issue by creating sustained CPU load at 
100%. strace showed that at some point a node failed to allocate memory 
by calling mmap().  After that the node closed all file descriptors, 
which was immediately detected by the "heart" process that in turn 
killed and restarted the node.  The only artifact seen was the error 
message above in the erlang.log.x file.

I don't know exactly if this was the same cause as we had in production 
(at least the production process didn't seem to have exhausted the 
memory) but the heart message in the log was identical.  What else can 
cause an Erlang node to close the pipe connecting it to the heart process?

I suggest you set up a monitoring process on that machine to log some 
statistics about the process (such as timestamp + /proc/PID/status), so 
that you can correlate process memory with a time of the failure.

Not sure how much this is helpful in your case, but this similar issue 
pops up once every couple of months in our production system followed by 
an automatic restart that remains unresolved.

Serge