[erlang-bugs] Supervisor terminate_child race

Tim Watson watson.timothy@REDACTED
Tue Apr 30 18:09:43 CEST 2013


Bryan,

On 30 Apr 2013, at 16:22, Bryan Fink wrote:

>> Thanks for reporting this. As far as I can understand, it must be the zero
>> timeout that is the problem. I assume that the EXIT signal arrives too late.
>> Can you confirm that?
> 
> That does seem to be exactly it, yes.
> 

But twiddling the timing there is just as racy, as you've noticed, right?

>> You are very welcome to contribute with a patch for this :)
> 
> I've spent some time fiddling with a couple of hacks, but they have
> not yet been clever enough. ;) It seems that there is no guarantee
> about when the EXIT signal will arrive. It might even come some amount
> of time after the DOWN message.
> 

Isn't the point that the EXIT signal might /never/ come, if the child un-links, or might come *after* the 'DOWN' if the race you've located occurs? Surely you've got to be able to handle either case? We ran into something similar with our supervisor2 fork a while back, whilst terminating (multiple) simple children: http://hg.rabbitmq.com/rabbitmq-server/rev/812d71d0716c . That code is somewhat different though, not only because it was terminating multiple children (during shutdown) but also because it explicitly unlinks from the child *after* creating the monitor, and /still/ allowed for an EXIT signal to have made its way into the mailbox unexpectedly. 

> Does this sound right to you, or is there something I might be overlooking?
> 

I'm very interested to see how this works out, as I've spent a while merging the upstream changes in R16B with Rabbit's supervisor2 module, and will need to integrate this fix into our codebase at some point too.

Cheers
Tim


More information about the erlang-bugs mailing list