[erlang-questions] Sudden death of Erlang Node
Fri Jan 19 14:55:39 CET 2007
The reason it doesn't record the erl_crash.dump is most likely because
the emulator gets killed (with a SIGKILL) by the heart process before it
has a chance to record anything useful in logs. As indicated in my last
email the (only?) reason for such a harsh action on heart's behalf is
that the emulator closed the file descriptor of the pipe connecting it
to the heart process, and heart initiated recovery.
What caused the emulator to close that file descriptor (aside from
memory exhaustion) is something that have kept bothering me for a while,
but given the fact that it happens very rarely makes it nearly
impossible to reproduce, and strace is not practical as it would fill up
all disk space by trace data before the issue occurs.
Perhaps the recovery strategy of heart needs to be changed from SIGKILL
to SIGTERM, followed by a short sleep to give Erlang a change and write
log details before exiting, and SIGKILL after that if the emulator's PID
is still running.
Has anyone else seen a similar behavior or have other ideas?
Eranga Udesh wrote:
> Thanks for the info.
> I said "very busy" to indicate the system I am talking about handles over
> 1000-1500 message passing, 750-1000 process spawning, 250-500 mnesia DB
> access, 500-750 Erl Port calls, etc, Per Second (/s) kind of busyness.
> However the CPU utilization doesn't go beyond 25% in any of the 4 CPU in the
> system and memory is over 40% free.
> Based on that, even though my end problem seems similar, the cause for it
> may not be. I am running 11B-2 compiled with SMP support, but the node is
> not started in SMP mode.
> In this particular node, I am running a couple of Erl Port Drivers developed
> in C.
> I guess if I can generate the erl_crash.dump, I should be able to find the
> cause for the problem. Why it's not generating?
> What methods do I have to identify the issue in a situation like this
> (activate debug, crash dump, etc)?
> - Eranga
> -----Original Message-----
> From: Serge Aleynikov [mailto:]
> Sent: Thursday, January 18, 2007 8:56 PM
> To: Eranga Udesh
> Subject: Re: [erlang-questions] Sudden death of Erlang Node
> Eranga Udesh wrote:
>> I have a very busy Erlang node running in a Quad Proc server with plenty
>> Ram. The server utilization is quite normal.
> You indicate that you have a "very busy" node, yet it's utilization is
> "quite normal". I find these definition contradictory. Could you
> define the peak utilization in CPU percentage consumption? If it is,
> say, over 90% that can't be considered normal.
>> However time to time, the
>> Erlang node goes to sudden death without any warnings. The erlang.log.x
>> files only show that the "heart" couldn't kill the server and the node
>> restarting info. Also I cannot find any erl_crash.dump file. Later I
>> introduced ERL_CRASH_DUMP and ERL_CRASH_DUMP_SECONDS environment variable
>> with different settings, but no luck. I use Erlang version 11B-2.
> We've experienced a similar issue intermittently with R11B-0 (without
> SMP - which is what we are running in production). The details can be
> found in this thread:
> Are you seeing the following message in the log?
> "heart: Wed Dec 13 18:59:54 2006: Erlang has closed."
> I managed to reproduce a similar issue by creating sustained CPU load at
> 100%. strace showed that at some point a node failed to allocate memory
> by calling mmap(). After that the node closed all file descriptors,
> which was immediately detected by the "heart" process that in turn
> killed and restarted the node. The only artifact seen was the error
> message above in the erlang.log.x file.
> I don't know exactly if this was the same cause as we had in production
> (at least the production process didn't seem to have exhausted the
> memory) but the heart message in the log was identical. What else can
> cause an Erlang node to close the pipe connecting it to the heart process?
> I suggest you set up a monitoring process on that machine to log some
> statistics about the process (such as timestamp + /proc/PID/status), so
> that you can correlate process memory with a time of the failure.
> Not sure how much this is helpful in your case, but this similar issue
> pops up once every couple of months in our production system followed by
> an automatic restart that remains unresolved.
More information about the erlang-questions