[erlang-bugs] heart prevents beam from creating crash dumps

Sat Aug 25 21:39:05 CEST 2012

We have had a long-standing problems with not getting any Erlang crash 
dumps at all on our live servers. I finally figured out why it happens. 
I have already reported this to the OTP folks, but I thought I should 
send a summary to the mailing lists for documentation and to give people 
a heads-up.

The problem occurs when you start Erlang with the -heart flag 
(http://www.erlang.org/doc/man/heart.html). This spawns a small external 
C program connected through a port. From Erlang's point of view it's 
like any other port program. The heart program pings the Erlang side 
every now and then, and if it gets no reply within HEART_BEAT_TIMEOUT 
seconds, or if the connection to Erlang breaks, it assumes the Beam 
process has gone bad and kills it off with a SIGKILL, and then restarts 
Erlang using whatever HEART_COMMAND is set to. So far so good.

Normally, when Beam detects a critical situation (e.g., out of memory) 
and decides to shut down, it will create an erl_crash.dump file (or 
whatever ERL_CRASH_DUMP is set to). This information can greatly help 
figuring out what went wrong. But if the system that crashed was large, 
the crash dump file can take quite a long time to create. In order to 
make it possible to restart the node (reusing the node name) while the 
old defunct system is still writing the crash dump, Beam wants to drop 
its connection to the EPMD service before it starts writing the dump, 
making it look like the old node has disappeared.

The code that does this is the function prepare_crash_dump() in 
erts/emulator/sys/unix/sys.c. The problem from the perspective of the C 
code is that the connection to EPMD is on some unknown file descriptor 
(just like heart, this has been started as a port from Erlang code). The 
solution they chose, and which has been part of the OTP system for 
years, is to close _all_ file descriptors except 0-2. This certainly has 
the desired effect that EPMD releases the node name for reuse. But it 
also, when the loop gets to file descriptor 10 or thereabouts (probably 
depending on your system), has the effect of breaking the connection to 
the heart program.

In these multicore days, the effect is almost instantaneous. The heart 
program immediately wakes up due to the broken pipe and sends SIGKILL to 
Beam for good measure, to make sure it's really gone, and then it starts 
a new Erlang node. Meanwhile, the old node is still busy closing file 
descriptors. Sometimes it makes it as far as 12 before SIGKILL arrives. 
The poor thing never has a chance to even open the crash dump file for 
writing. And your operations people only see a weird restart without any 
further clues.

I don't have a good solution right now, except "don't use -heart". And 
it might be that one wants to separate the automatic restarting of a 
crashed node from the automatic killing of an unresponsive node anyway. 
Suggestions are welcome.

     /Richard