[erlang-questions] [erlang-bugs] heart prevents beam from creating crash dumps

Sat Aug 25 21:48:42 CEST 2012

On Sat, Aug 25, 2012 at 3:39 PM, Richard Carlsson
<carlsson.richard@REDACTED> wrote:
> We have had a long-standing problems with not getting any Erlang crash dumps
> at all on our live servers. I finally figured out why it happens. I have
> already reported this to the OTP folks, but I thought I should send a
> summary to the mailing lists for documentation and to give people a
> heads-up.
>
> The problem occurs when you start Erlang with the -heart flag
> (http://www.erlang.org/doc/man/heart.html). This spawns a small external C
> program connected through a port. From Erlang's point of view it's like any
> other port program. The heart program pings the Erlang side every now and
> then, and if it gets no reply within HEART_BEAT_TIMEOUT seconds, or if the
> connection to Erlang breaks, it assumes the Beam process has gone bad and
> kills it off with a SIGKILL, and then restarts Erlang using whatever
> HEART_COMMAND is set to. So far so good.
>
> Normally, when Beam detects a critical situation (e.g., out of memory) and
> decides to shut down, it will create an erl_crash.dump file (or whatever
> ERL_CRASH_DUMP is set to). This information can greatly help figuring out
> what went wrong. But if the system that crashed was large, the crash dump
> file can take quite a long time to create. In order to make it possible to
> restart the node (reusing the node name) while the old defunct system is
> still writing the crash dump, Beam wants to drop its connection to the EPMD
> service before it starts writing the dump, making it look like the old node
> has disappeared.
>
> The code that does this is the function prepare_crash_dump() in
> erts/emulator/sys/unix/sys.c. The problem from the perspective of the C code
> is that the connection to EPMD is on some unknown file descriptor (just like
> heart, this has been started as a port from Erlang code). The solution they
> chose, and which has been part of the OTP system for years, is to close
> _all_ file descriptors except 0-2. This certainly has the desired effect
> that EPMD releases the node name for reuse. But it also, when the loop gets
> to file descriptor 10 or thereabouts (probably depending on your system),
> has the effect of breaking the connection to the heart program.
>
> In these multicore days, the effect is almost instantaneous. The heart
> program immediately wakes up due to the broken pipe and sends SIGKILL to
> Beam for good measure, to make sure it's really gone, and then it starts a
> new Erlang node. Meanwhile, the old node is still busy closing file
> descriptors. Sometimes it makes it as far as 12 before SIGKILL arrives. The
> poor thing never has a chance to even open the crash dump file for writing.
> And your operations people only see a weird restart without any further
> clues.
>
> I don't have a good solution right now, except "don't use -heart". And it
> might be that one wants to separate the automatic restarting of a crashed
> node from the automatic killing of an unresponsive node anyway. Suggestions
> are welcome.

Hi Richard, I hit this problem a few years ago. Here's the thread
starting from where I posted a temporary solution:

http://erlang.org/pipermail/erlang-questions/2010-August/052970.html

Unfortunately no patches came out of that conversation, but Ulf had an
idea that might be worth exploring in a followup to the post linked
above.

--steve