[erlang-questions] Sudden death of Erlang Node
Fri Jan 19 07:15:45 CET 2007
Thanks for the info.
I said "very busy" to indicate the system I am talking about handles over
1000-1500 message passing, 750-1000 process spawning, 250-500 mnesia DB
access, 500-750 Erl Port calls, etc, Per Second (/s) kind of busyness.
However the CPU utilization doesn't go beyond 25% in any of the 4 CPU in the
system and memory is over 40% free.
Based on that, even though my end problem seems similar, the cause for it
may not be. I am running 11B-2 compiled with SMP support, but the node is
not started in SMP mode.
In this particular node, I am running a couple of Erl Port Drivers developed
I guess if I can generate the erl_crash.dump, I should be able to find the
cause for the problem. Why it's not generating?
What methods do I have to identify the issue in a situation like this
(activate debug, crash dump, etc)?
From: Serge Aleynikov [mailto:]
Sent: Thursday, January 18, 2007 8:56 PM
To: Eranga Udesh
Subject: Re: [erlang-questions] Sudden death of Erlang Node
Eranga Udesh wrote:
> I have a very busy Erlang node running in a Quad Proc server with plenty
> Ram. The server utilization is quite normal.
You indicate that you have a "very busy" node, yet it's utilization is
"quite normal". I find these definition contradictory. Could you
define the peak utilization in CPU percentage consumption? If it is,
say, over 90% that can't be considered normal.
> However time to time, the
> Erlang node goes to sudden death without any warnings. The erlang.log.x
> files only show that the "heart" couldn't kill the server and the node
> restarting info. Also I cannot find any erl_crash.dump file. Later I
> introduced ERL_CRASH_DUMP and ERL_CRASH_DUMP_SECONDS environment variable
> with different settings, but no luck. I use Erlang version 11B-2.
We've experienced a similar issue intermittently with R11B-0 (without
SMP - which is what we are running in production). The details can be
found in this thread:
Are you seeing the following message in the log?
"heart: Wed Dec 13 18:59:54 2006: Erlang has closed."
I managed to reproduce a similar issue by creating sustained CPU load at
100%. strace showed that at some point a node failed to allocate memory
by calling mmap(). After that the node closed all file descriptors,
which was immediately detected by the "heart" process that in turn
killed and restarted the node. The only artifact seen was the error
message above in the erlang.log.x file.
I don't know exactly if this was the same cause as we had in production
(at least the production process didn't seem to have exhausted the
memory) but the heart message in the log was identical. What else can
cause an Erlang node to close the pipe connecting it to the heart process?
I suggest you set up a monitoring process on that machine to log some
statistics about the process (such as timestamp + /proc/PID/status), so
that you can correlate process memory with a time of the failure.
Not sure how much this is helpful in your case, but this similar issue
pops up once every couple of months in our production system followed by
an automatic restart that remains unresolved.
More information about the erlang-questions