[erlang-questions] Sudden death of Erlang Node

Fri Jan 19 15:28:34 CET 2007

BTW, if the issue is indeed related to the heart process killing the 
emulator, you can start the node without the "-heart" option (perhaps 
using alternative recovery such as a loop in the shell script starting 
run_erl), and seeing if you get a crash dump.

Serge

Serge Aleynikov wrote:
> The reason it doesn't record the erl_crash.dump is most likely because 
> the emulator gets killed (with a SIGKILL) by the heart process before it 
> has a chance to record anything useful in logs.  As indicated in my last 
> email the (only?) reason for such a harsh action on heart's behalf is 
> that the emulator closed the file descriptor of the pipe connecting it 
> to the heart process, and heart initiated recovery.
> 
> What caused the emulator to close that file descriptor (aside from 
> memory exhaustion) is something that have kept bothering me for a while, 
> but given the fact that it happens very rarely makes it nearly 
> impossible to reproduce, and strace is not practical as it would fill up 
> all disk space by trace data before the issue occurs.
> 
> Perhaps the recovery strategy of heart needs to be changed from SIGKILL 
> to SIGTERM, followed by a short sleep to give Erlang a change and write 
> log details before exiting, and SIGKILL after that if the emulator's PID 
> is still running.
> 
> Has anyone else seen a similar behavior or have other ideas?
> 
> Serge
> 
> Eranga Udesh wrote:
>> Thanks for the info.
>>
>> I said "very busy" to indicate the system I am talking about handles over
>> 1000-1500 message passing, 750-1000 process spawning, 250-500 mnesia DB
>> access, 500-750 Erl Port calls, etc, Per Second (/s) kind of busyness.
>> However the CPU utilization doesn't go beyond 25% in any of the 4 CPU in the
>> system and memory is over 40% free.
>>
>> Based on that, even though my end problem seems similar, the cause for it
>> may not be. I am running 11B-2 compiled with SMP support, but the node is
>> not started in SMP mode.
>>
>> In this particular node, I am running a couple of Erl Port Drivers developed
>> in C.
>> I guess if I can generate the erl_crash.dump, I should be able to find the
>> cause for the problem. Why it's not generating?
>>
>> What methods do I have to identify the issue in a situation like this
>> (activate debug, crash dump, etc)?
>>
>> Thanks,
>> - Eranga
>>
>>
>>
>> -----Original Message-----
>> From: Serge Aleynikov [mailto:serge@REDACTED] 
>> Sent: Thursday, January 18, 2007 8:56 PM
>> To: Eranga Udesh
>> Cc: erlang-questions@REDACTED
>> Subject: Re: [erlang-questions] Sudden death of Erlang Node
>>
>> Eranga Udesh wrote:
>>> Hi,
>>>
>>> I have a very busy Erlang node running in a Quad Proc server with plenty
>> of
>>> Ram. The server utilization is quite normal. 
>> You indicate that you have a "very busy" node, yet it's utilization is 
>> "quite normal".  I find these definition contradictory.  Could you 
>> define the peak utilization in CPU percentage consumption?  If it is, 
>> say, over 90% that can't be considered normal.
>>
>>> However time to time, the
>>> Erlang node goes to sudden death without any warnings. The erlang.log.x
>> log
>>> files only show that the "heart" couldn't kill the server and the node
>>> restarting info. Also I cannot find any erl_crash.dump file. Later I
>>> introduced ERL_CRASH_DUMP and ERL_CRASH_DUMP_SECONDS environment variable
>>> with different settings, but no luck. I use Erlang version 11B-2.
>> We've experienced a similar issue intermittently with R11B-0 (without 
>> SMP - which is what we are running in production).  The details can be 
>> found in this thread:
>>
>> http://www.erlang.org/pipermail/erlang-questions/2006-December/024365.html
>>
>> Are you seeing the following message in the log?
>>
>>     "heart: Wed Dec 13 18:59:54 2006: Erlang has closed."
>>
>> I managed to reproduce a similar issue by creating sustained CPU load at 
>> 100%. strace showed that at some point a node failed to allocate memory 
>> by calling mmap().  After that the node closed all file descriptors, 
>> which was immediately detected by the "heart" process that in turn 
>> killed and restarted the node.  The only artifact seen was the error 
>> message above in the erlang.log.x file.
>>
>> I don't know exactly if this was the same cause as we had in production 
>> (at least the production process didn't seem to have exhausted the 
>> memory) but the heart message in the log was identical.  What else can 
>> cause an Erlang node to close the pipe connecting it to the heart process?
>>
>> I suggest you set up a monitoring process on that machine to log some 
>> statistics about the process (such as timestamp + /proc/PID/status), so 
>> that you can correlate process memory with a time of the failure.
>>
>> Not sure how much this is helpful in your case, but this similar issue 
>> pops up once every couple of months in our production system followed by 
>> an automatic restart that remains unresolved.
>>
>> Serge