[erlang-questions] Sudden death of Erlang Node
Thu Jan 18 16:26:29 CET 2007
Eranga Udesh wrote:
> I have a very busy Erlang node running in a Quad Proc server with plenty of
> Ram. The server utilization is quite normal.
You indicate that you have a "very busy" node, yet it's utilization is
"quite normal". I find these definition contradictory. Could you
define the peak utilization in CPU percentage consumption? If it is,
say, over 90% that can't be considered normal.
> However time to time, the
> Erlang node goes to sudden death without any warnings. The erlang.log.x log
> files only show that the "heart" couldn't kill the server and the node
> restarting info. Also I cannot find any erl_crash.dump file. Later I
> introduced ERL_CRASH_DUMP and ERL_CRASH_DUMP_SECONDS environment variable
> with different settings, but no luck. I use Erlang version 11B-2.
We've experienced a similar issue intermittently with R11B-0 (without
SMP - which is what we are running in production). The details can be
found in this thread:
Are you seeing the following message in the log?
"heart: Wed Dec 13 18:59:54 2006: Erlang has closed."
I managed to reproduce a similar issue by creating sustained CPU load at
100%. strace showed that at some point a node failed to allocate memory
by calling mmap(). After that the node closed all file descriptors,
which was immediately detected by the "heart" process that in turn
killed and restarted the node. The only artifact seen was the error
message above in the erlang.log.x file.
I don't know exactly if this was the same cause as we had in production
(at least the production process didn't seem to have exhausted the
memory) but the heart message in the log was identical. What else can
cause an Erlang node to close the pipe connecting it to the heart process?
I suggest you set up a monitoring process on that machine to log some
statistics about the process (such as timestamp + /proc/PID/status), so
that you can correlate process memory with a time of the failure.
Not sure how much this is helpful in your case, but this similar issue
pops up once every couple of months in our production system followed by
an automatic restart that remains unresolved.
More information about the erlang-questions