[erlang-questions] Sudden death of Erlang Node

Fri Jan 19 07:15:45 CET 2007

Thanks for the info.

I said "very busy" to indicate the system I am talking about handles over
1000-1500 message passing, 750-1000 process spawning, 250-500 mnesia DB
access, 500-750 Erl Port calls, etc, Per Second (/s) kind of busyness.
However the CPU utilization doesn't go beyond 25% in any of the 4 CPU in the
system and memory is over 40% free.

Based on that, even though my end problem seems similar, the cause for it
may not be. I am running 11B-2 compiled with SMP support, but the node is
not started in SMP mode.

In this particular node, I am running a couple of Erl Port Drivers developed
in C.
I guess if I can generate the erl_crash.dump, I should be able to find the
cause for the problem. Why it's not generating?

What methods do I have to identify the issue in a situation like this
(activate debug, crash dump, etc)?

Thanks,
- Eranga

-----Original Message-----
From: Serge Aleynikov [mailto:serge@REDACTED] 
Sent: Thursday, January 18, 2007 8:56 PM
To: Eranga Udesh
Cc: erlang-questions@REDACTED
Subject: Re: [erlang-questions] Sudden death of Erlang Node

Eranga Udesh wrote:
> Hi,
> 
> I have a very busy Erlang node running in a Quad Proc server with plenty
of
> Ram. The server utilization is quite normal. 

You indicate that you have a "very busy" node, yet it's utilization is 
"quite normal".  I find these definition contradictory.  Could you 
define the peak utilization in CPU percentage consumption?  If it is, 
say, over 90% that can't be considered normal.

> However time to time, the
> Erlang node goes to sudden death without any warnings. The erlang.log.x
log
> files only show that the "heart" couldn't kill the server and the node
> restarting info. Also I cannot find any erl_crash.dump file. Later I
> introduced ERL_CRASH_DUMP and ERL_CRASH_DUMP_SECONDS environment variable
> with different settings, but no luck. I use Erlang version 11B-2.

We've experienced a similar issue intermittently with R11B-0 (without 
SMP - which is what we are running in production).  The details can be 
found in this thread:

http://www.erlang.org/pipermail/erlang-questions/2006-December/024365.html

Are you seeing the following message in the log?

    "heart: Wed Dec 13 18:59:54 2006: Erlang has closed."

I managed to reproduce a similar issue by creating sustained CPU load at 
100%. strace showed that at some point a node failed to allocate memory 
by calling mmap().  After that the node closed all file descriptors, 
which was immediately detected by the "heart" process that in turn 
killed and restarted the node.  The only artifact seen was the error 
message above in the erlang.log.x file.

I don't know exactly if this was the same cause as we had in production 
(at least the production process didn't seem to have exhausted the 
memory) but the heart message in the log was identical.  What else can 
cause an Erlang node to close the pipe connecting it to the heart process?

I suggest you set up a monitoring process on that machine to log some 
statistics about the process (such as timestamp + /proc/PID/status), so 
that you can correlate process memory with a time of the failure.

Not sure how much this is helpful in your case, but this similar issue 
pops up once every couple of months in our production system followed by 
an automatic restart that remains unresolved.

Serge