[erlang-questions] Sudden death of Erlang Node

Fri Jan 19 08:31:24 CET 2007

I don't know why you don't get an erl_crash dump.
But if/when you do get one, then I can recommend the
crashdump_viewer. It is excellent actually!!

1> crashdump_viewer:start().
WebTool is available at http://localhost:8888/
ok

Cheers, Tobbe

Eranga Udesh wrote:
> Thanks for the info.
> 
> I said "very busy" to indicate the system I am talking about handles over
> 1000-1500 message passing, 750-1000 process spawning, 250-500 mnesia DB
> access, 500-750 Erl Port calls, etc, Per Second (/s) kind of busyness.
> However the CPU utilization doesn't go beyond 25% in any of the 4 CPU in the
> system and memory is over 40% free.
> 
> Based on that, even though my end problem seems similar, the cause for it
> may not be. I am running 11B-2 compiled with SMP support, but the node is
> not started in SMP mode.
> 
> In this particular node, I am running a couple of Erl Port Drivers developed
> in C.
> I guess if I can generate the erl_crash.dump, I should be able to find the
> cause for the problem. Why it's not generating?
> 
> What methods do I have to identify the issue in a situation like this
> (activate debug, crash dump, etc)?
> 
> Thanks,
> - Eranga
> 
> 
> 
> -----Original Message-----
> From: Serge Aleynikov [mailto:serge@REDACTED] 
> Sent: Thursday, January 18, 2007 8:56 PM
> To: Eranga Udesh
> Cc: erlang-questions@REDACTED
> Subject: Re: [erlang-questions] Sudden death of Erlang Node
> 
> Eranga Udesh wrote:
>> Hi,
>>
>> I have a very busy Erlang node running in a Quad Proc server with plenty
> of
>> Ram. The server utilization is quite normal. 
> 
> You indicate that you have a "very busy" node, yet it's utilization is 
> "quite normal".  I find these definition contradictory.  Could you 
> define the peak utilization in CPU percentage consumption?  If it is, 
> say, over 90% that can't be considered normal.
> 
>> However time to time, the
>> Erlang node goes to sudden death without any warnings. The erlang.log.x
> log
>> files only show that the "heart" couldn't kill the server and the node
>> restarting info. Also I cannot find any erl_crash.dump file. Later I
>> introduced ERL_CRASH_DUMP and ERL_CRASH_DUMP_SECONDS environment variable
>> with different settings, but no luck. I use Erlang version 11B-2.
> 
> We've experienced a similar issue intermittently with R11B-0 (without 
> SMP - which is what we are running in production).  The details can be 
> found in this thread:
> 
> http://www.erlang.org/pipermail/erlang-questions/2006-December/024365.html
> 
> Are you seeing the following message in the log?
> 
>     "heart: Wed Dec 13 18:59:54 2006: Erlang has closed."
> 
> I managed to reproduce a similar issue by creating sustained CPU load at 
> 100%. strace showed that at some point a node failed to allocate memory 
> by calling mmap().  After that the node closed all file descriptors, 
> which was immediately detected by the "heart" process that in turn 
> killed and restarted the node.  The only artifact seen was the error 
> message above in the erlang.log.x file.
> 
> I don't know exactly if this was the same cause as we had in production 
> (at least the production process didn't seem to have exhausted the 
> memory) but the heart message in the log was identical.  What else can 
> cause an Erlang node to close the pipe connecting it to the heart process?
> 
> I suggest you set up a monitoring process on that machine to log some 
> statistics about the process (such as timestamp + /proc/PID/status), so 
> that you can correlate process memory with a time of the failure.
> 
> Not sure how much this is helpful in your case, but this similar issue 
> pops up once every couple of months in our production system followed by 
> an automatic restart that remains unresolved.
> 
> Serge
>