[erlang-bugs] Supervisor terminate_child race
Tue Apr 30 23:13:00 CEST 2013
On 30 Apr 2013, at 18:34, Bryan Fink wrote:
>> But twiddling the timing there is just as racy, as you've noticed, right?
> Correct. The length of the timeout is irrelevant. The EXIT signal is
> not guaranteed to arrive within any specific amount of time.
Indeed. Almost a halting problem this isn't it. :)
>> Isn't the point that the EXIT signal might /never/ come, if the child un-links, or might come *after* the 'DOWN' if the race you've located occurs? Surely you've got to be able to handle either case?
> Yes, the point of the monitor is to handle the case where the EXIT
> never comes (because the child unlinks). It is not the case, however,
> that the EXIT always arrives after the DOWN in the race I'm seeing.
> They might both be delayed.
Waiting without a timeout for the 'DOWN' is acceptable, because you've got a guarantee (via the runtime) the it *will* arrive, no matter what state the target process was in when you created the monitor. Waiting some arbitrary time for the 'EXIT' is a real problem though, because you could wait forever.
> Handling either order is important, but the problem with this race is
> that only the EXIT message contains the actual exit reason when this
> happens. The 'noproc' in the DOWN is just saying that there was no
> process to monitor.
Indeed. But it could equally be true that the 'EXIT' signal was never dispatched, because the child process unlinked before it died; You can't wait forever for the 'EXIT' after you've seen a 'DOWN' with 'noproc' as the reason, so now you've got to choose how long to wait, but whatever timing works for one particular case isn't going to solve the general problem.
>> We ran into something similar with our supervisor2 fork a while back, whilst terminating (multiple) simple children: http://hg.rabbitmq.com/rabbitmq-server/rev/812d71d0716c . That code is somewhat different though, not only because it was terminating multiple children (during shutdown) but also because it explicitly unlinks from the child *after* creating the monitor, and /still/ allowed for an EXIT signal to have made its way into the mailbox unexpectedly.
> The monitor_child/1 function also unlinks from the child after
> creating the monitor. That patch looks a little bit like the fixes I
> was trying. Basically it's checking for an EXIT message after
> receiving the DOWN, just in case one is in the mailbox, yes?
> The problem is that it might still miss an EXIT, because it might still
> not have arrived yet, even though it will later.
Yes that's definitely true and we were aware of that problem, however since we know we cannot wait for the 'EXIT' forever and whatever arbitrary timeout we choose is just someone else's race condition, we decided that if the EXIT signal wasn't delivered expediently to the process' mailbox, that loosing the real exit reason was something we could live with in the worst case.
Since we've started merging the R15/R16 changes in though, that code has disappeared so we're in the same boat as you guys. :)
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the erlang-bugs