[erlang-bugs] Supervisor terminate_child race

Bryan Fink bryan@REDACTED
Tue Apr 30 19:34:10 CEST 2013


On Tue, Apr 30, 2013 at 12:09 PM, Tim Watson <watson.timothy@REDACTED> wrote:
>
> On 30 Apr 2013, at 16:22, Bryan Fink wrote:
>
>>> Thanks for reporting this. As far as I can understand, it must be the zero
>>> timeout that is the problem. I assume that the EXIT signal arrives too late.
>>> Can you confirm that?
>>
>> That does seem to be exactly it, yes.
>>
>
> But twiddling the timing there is just as racy, as you've noticed, right?

Correct. The length of the timeout is irrelevant. The EXIT signal is
not guaranteed to arrive within any specific amount of time.

>
>>> You are very welcome to contribute with a patch for this :)
>>
>> I've spent some time fiddling with a couple of hacks, but they have
>> not yet been clever enough. ;) It seems that there is no guarantee
>> about when the EXIT signal will arrive. It might even come some amount
>> of time after the DOWN message.
>>
>
> Isn't the point that the EXIT signal might /never/ come, if the child un-links, or might come *after* the 'DOWN' if the race you've located occurs? Surely you've got to be able to handle either case?

Yes, the point of the monitor is to handle the case where the EXIT
never comes (because the child unlinks). It is not the case, however,
that the EXIT always arrives after the DOWN in the race I'm seeing.
They might both be delayed.

Handling either order is important, but the problem with this race is
that only the EXIT message contains the actual exit reason when this
happens. The 'noproc' in the DOWN is just saying that there was no
process to monitor.

> We ran into something similar with our supervisor2 fork a while back, whilst terminating (multiple) simple children: http://hg.rabbitmq.com/rabbitmq-server/rev/812d71d0716c . That code is somewhat different though, not only because it was terminating multiple children (during shutdown) but also because it explicitly unlinks from the child *after* creating the monitor, and /still/ allowed for an EXIT signal to have made its way into the mailbox unexpectedly.

The monitor_child/1 function also unlinks from the child after
creating the monitor. That patch looks a little bit like the fixes I
was trying. Basically it's checking for an EXIT message after
receiving the DOWN, just in case one is in the mailbox, yes? The
problem is that it might still miss an EXIT, because it might still
not have arrived yet, even though it will later.

-Bryan



More information about the erlang-bugs mailing list