[erlang-bugs] heart prevents beam from creating crash dumps

Mon Aug 27 12:22:59 CEST 2012

> On 27-08-2012 10:16, Paul Guyot wrote:
>>> It would be great if there was a way of simply figuring out the file
>>> descriptor numbers used for EPMD and/or heart from the C code. Then it
>>> would be easy to fix this. One possibility is to add a new BIF that
>>> stores the current EPMD port in a C variable. Then the loop that closes
>>> all ports could be replaced with a single close.
>> I would like to strongly argue against a selected close of the epmd port.
>> 
>> The main argument is that it singles out the node name as the only exclusive resource a node can have.
> I was thinking the same thing, but about listening sockets.
> It'd probably make better sense to make the heart FD be the special 
> case, like stdin/out/err are already.
> 
> Alternatively, could the node communicate to heart that it is dumping, 
> so that heart can leave it be (at least for a while)? That would of 
> course also make the FD special, so probably no gain.
> 
> Is kill-on-close the right behaviour for heart? The "if the connection 
> to Erlang breaks, it assumes the Beam process has gone bad" reaction, is 
> it justified? That could be changed to "if the connection breaks, then 
> assume - at least for a while - that the Beam process is going down." 
> If, after a timeout after the connection disappeared, the Beam process 
> is still around (or, possibly, that heart observes no growth in the dump 
> file... don't know if that's too much complexity), *then* go assume 
> badness and go for the kill. For great justice.

Altering the protocol makes sense as heart will kill the beam process if it did not get a beat for a given period (60 seconds by default). When the VM is crashing, no erlang code is running and therefore no beat signal is ever sent again to heart. I would not be surprised that preparing a dump and writing it to disk can typically take more than 30 seconds.

Communication between heart and the VM lies in several channels. As an alternative to altering the protocol on the heart socket, we could change the signals heart sends to the VM (currently a SIGKILL every second for 5 seconds), although this would not be compatible with Windows' version of heart or custom heart commands.

Alternatively, the VM could fork on exit, with both processes closing all files (after the fork, to make sure heart does not kill the parent beforehand) and the child writing the dump. I am not sure how this would transpose on Windows, though.

Paul
-- 
Semiocast            http://semiocast.com/
+33.183627948 - 20 rue Lacaze, 75014 Paris