[erlang-questions] Sudden death of Erlang Node
Fri Jan 19 15:28:34 CET 2007
BTW, if the issue is indeed related to the heart process killing the
emulator, you can start the node without the "-heart" option (perhaps
using alternative recovery such as a loop in the shell script starting
run_erl), and seeing if you get a crash dump.
Serge Aleynikov wrote:
> The reason it doesn't record the erl_crash.dump is most likely because
> the emulator gets killed (with a SIGKILL) by the heart process before it
> has a chance to record anything useful in logs. As indicated in my last
> email the (only?) reason for such a harsh action on heart's behalf is
> that the emulator closed the file descriptor of the pipe connecting it
> to the heart process, and heart initiated recovery.
> What caused the emulator to close that file descriptor (aside from
> memory exhaustion) is something that have kept bothering me for a while,
> but given the fact that it happens very rarely makes it nearly
> impossible to reproduce, and strace is not practical as it would fill up
> all disk space by trace data before the issue occurs.
> Perhaps the recovery strategy of heart needs to be changed from SIGKILL
> to SIGTERM, followed by a short sleep to give Erlang a change and write
> log details before exiting, and SIGKILL after that if the emulator's PID
> is still running.
> Has anyone else seen a similar behavior or have other ideas?
> Eranga Udesh wrote:
>> Thanks for the info.
>> I said "very busy" to indicate the system I am talking about handles over
>> 1000-1500 message passing, 750-1000 process spawning, 250-500 mnesia DB
>> access, 500-750 Erl Port calls, etc, Per Second (/s) kind of busyness.
>> However the CPU utilization doesn't go beyond 25% in any of the 4 CPU in the
>> system and memory is over 40% free.
>> Based on that, even though my end problem seems similar, the cause for it
>> may not be. I am running 11B-2 compiled with SMP support, but the node is
>> not started in SMP mode.
>> In this particular node, I am running a couple of Erl Port Drivers developed
>> in C.
>> I guess if I can generate the erl_crash.dump, I should be able to find the
>> cause for the problem. Why it's not generating?
>> What methods do I have to identify the issue in a situation like this
>> (activate debug, crash dump, etc)?
>> - Eranga
>> -----Original Message-----
>> From: Serge Aleynikov [mailto:]
>> Sent: Thursday, January 18, 2007 8:56 PM
>> To: Eranga Udesh
>> Subject: Re: [erlang-questions] Sudden death of Erlang Node
>> Eranga Udesh wrote:
>>> I have a very busy Erlang node running in a Quad Proc server with plenty
>>> Ram. The server utilization is quite normal.
>> You indicate that you have a "very busy" node, yet it's utilization is
>> "quite normal". I find these definition contradictory. Could you
>> define the peak utilization in CPU percentage consumption? If it is,
>> say, over 90% that can't be considered normal.
>>> However time to time, the
>>> Erlang node goes to sudden death without any warnings. The erlang.log.x
>>> files only show that the "heart" couldn't kill the server and the node
>>> restarting info. Also I cannot find any erl_crash.dump file. Later I
>>> introduced ERL_CRASH_DUMP and ERL_CRASH_DUMP_SECONDS environment variable
>>> with different settings, but no luck. I use Erlang version 11B-2.
>> We've experienced a similar issue intermittently with R11B-0 (without
>> SMP - which is what we are running in production). The details can be
>> found in this thread:
>> Are you seeing the following message in the log?
>> "heart: Wed Dec 13 18:59:54 2006: Erlang has closed."
>> I managed to reproduce a similar issue by creating sustained CPU load at
>> 100%. strace showed that at some point a node failed to allocate memory
>> by calling mmap(). After that the node closed all file descriptors,
>> which was immediately detected by the "heart" process that in turn
>> killed and restarted the node. The only artifact seen was the error
>> message above in the erlang.log.x file.
>> I don't know exactly if this was the same cause as we had in production
>> (at least the production process didn't seem to have exhausted the
>> memory) but the heart message in the log was identical. What else can
>> cause an Erlang node to close the pipe connecting it to the heart process?
>> I suggest you set up a monitoring process on that machine to log some
>> statistics about the process (such as timestamp + /proc/PID/status), so
>> that you can correlate process memory with a time of the failure.
>> Not sure how much this is helpful in your case, but this similar issue
>> pops up once every couple of months in our production system followed by
>> an automatic restart that remains unresolved.
More information about the erlang-questions