[erlang-bugs] Supervisor terminate_child race
Tue Apr 30 18:09:43 CEST 2013
On 30 Apr 2013, at 16:22, Bryan Fink wrote:
>> Thanks for reporting this. As far as I can understand, it must be the zero
>> timeout that is the problem. I assume that the EXIT signal arrives too late.
>> Can you confirm that?
> That does seem to be exactly it, yes.
But twiddling the timing there is just as racy, as you've noticed, right?
>> You are very welcome to contribute with a patch for this :)
> I've spent some time fiddling with a couple of hacks, but they have
> not yet been clever enough. ;) It seems that there is no guarantee
> about when the EXIT signal will arrive. It might even come some amount
> of time after the DOWN message.
Isn't the point that the EXIT signal might /never/ come, if the child un-links, or might come *after* the 'DOWN' if the race you've located occurs? Surely you've got to be able to handle either case? We ran into something similar with our supervisor2 fork a while back, whilst terminating (multiple) simple children: http://hg.rabbitmq.com/rabbitmq-server/rev/812d71d0716c . That code is somewhat different though, not only because it was terminating multiple children (during shutdown) but also because it explicitly unlinks from the child *after* creating the monitor, and /still/ allowed for an EXIT signal to have made its way into the mailbox unexpectedly.
> Does this sound right to you, or is there something I might be overlooking?
I'm very interested to see how this works out, as I've spent a while merging the upstream changes in R16B with Rabbit's supervisor2 module, and will need to integrate this fix into our codebase at some point too.
More information about the erlang-bugs