[erlang-bugs] heart prevents beam from creating crash dumps

Sat Aug 25 22:03:02 CEST 2012

On 08/25/2012 09:48 PM, Steve Vinoski wrote:
> On Sat, Aug 25, 2012 at 3:39 PM, Richard Carlsson
> <carlsson.richard@REDACTED> wrote:
>> We have had a long-standing problems with not getting any Erlang crash dumps
>> at all on our live servers. I finally figured out why it happens. I have
>> already reported this to the OTP folks, but I thought I should send a
>> summary to the mailing lists for documentation and to give people a
>> heads-up.
>>
>> The problem occurs when you start Erlang with the -heart flag
>> (http://www.erlang.org/doc/man/heart.html). This spawns a small external C
>> program connected through a port. From Erlang's point of view it's like any
>> other port program. The heart program pings the Erlang side every now and
>> then, and if it gets no reply within HEART_BEAT_TIMEOUT seconds, or if the
>> connection to Erlang breaks, it assumes the Beam process has gone bad and
>> kills it off with a SIGKILL, and then restarts Erlang using whatever
>> HEART_COMMAND is set to. So far so good.
>>
>> Normally, when Beam detects a critical situation (e.g., out of memory) and
>> decides to shut down, it will create an erl_crash.dump file (or whatever
>> ERL_CRASH_DUMP is set to). This information can greatly help figuring out
>> what went wrong. But if the system that crashed was large, the crash dump
>> file can take quite a long time to create. In order to make it possible to
>> restart the node (reusing the node name) while the old defunct system is
>> still writing the crash dump, Beam wants to drop its connection to the EPMD
>> service before it starts writing the dump, making it look like the old node
>> has disappeared.
>>
>> The code that does this is the function prepare_crash_dump() in
>> erts/emulator/sys/unix/sys.c. The problem from the perspective of the C code
>> is that the connection to EPMD is on some unknown file descriptor (just like
>> heart, this has been started as a port from Erlang code). The solution they
>> chose, and which has been part of the OTP system for years, is to close
>> _all_ file descriptors except 0-2. This certainly has the desired effect
>> that EPMD releases the node name for reuse. But it also, when the loop gets
>> to file descriptor 10 or thereabouts (probably depending on your system),
>> has the effect of breaking the connection to the heart program.
>>
>> In these multicore days, the effect is almost instantaneous. The heart
>> program immediately wakes up due to the broken pipe and sends SIGKILL to
>> Beam for good measure, to make sure it's really gone, and then it starts a
>> new Erlang node. Meanwhile, the old node is still busy closing file
>> descriptors. Sometimes it makes it as far as 12 before SIGKILL arrives. The
>> poor thing never has a chance to even open the crash dump file for writing.
>> And your operations people only see a weird restart without any further
>> clues.
>>
>> I don't have a good solution right now, except "don't use -heart". And it
>> might be that one wants to separate the automatic restarting of a crashed
>> node from the automatic killing of an unresponsive node anyway. Suggestions
>> are welcome.
>
> Hi Richard, I hit this problem a few years ago. Here's the thread
> starting from where I posted a temporary solution:
>
> http://erlang.org/pipermail/erlang-questions/2010-August/052970.html

Yes, I had seen that. (It was pretty much the only thing that Google 
came up with for this particular topic.) But the key point that was 
missing from that discussion was that it ironically enough is the act of 
preparing to write a crash dump that ends up killing the system before 
it can write the crash dump.

It would be great if there was a way of simply figuring out the file 
descriptor numbers used for EPMD and/or heart from the C code. Then it 
would be easy to fix this. One possibility is to add a new BIF that 
stores the current EPMD port in a C variable. Then the loop that closes 
all ports could be replaced with a single close.

     /Richard