[erlang-bugs] Supervisor terminate_child race

Mon May 13 17:41:43 CEST 2013

Bryan and Tim, your analysis is very good, and the problem is complicated.
I don't see a "water tight" solution right now, and I can not spend too
much time pondering without having a real priority for this case. I have
written a ticket for it, and it will be prioritized along with all other
backlog items. Any further thoughts and contributions will be very much
appreciated :)
Thanks again
/siri

2013/4/30 Tim Watson <watson.timothy@REDACTED>

> Hi Bryan,
>
> On 30 Apr 2013, at 18:34, Bryan Fink wrote:
>
>
> But twiddling the timing there is just as racy, as you've noticed, right?
>
>
> Correct. The length of the timeout is irrelevant. The EXIT signal is
> not guaranteed to arrive within any specific amount of time.
>
>
> Indeed. Almost a halting problem this isn't it. :)
>
>
> Isn't the point that the EXIT signal might /never/ come, if the child
> un-links, or might come *after* the 'DOWN' if the race you've located
> occurs? Surely you've got to be able to handle either case?
>
>
> Yes, the point of the monitor is to handle the case where the EXIT
> never comes (because the child unlinks). It is not the case, however,
> that the EXIT always arrives after the DOWN in the race I'm seeing.
> They might both be delayed.
>
>
> Waiting without a timeout for the 'DOWN' is acceptable, because you've got
> a guarantee (via the runtime) the it *will* arrive, no matter what state
> the target process was in when you created the monitor. Waiting some
> arbitrary time for the 'EXIT' is a real problem though, because you could
> wait forever.
>
> Handling either order is important, but the problem with this race is
> that only the EXIT message contains the actual exit reason when this
> happens. The 'noproc' in the DOWN is just saying that there was no
> process to monitor.
>
>
> Indeed. But it could equally be true that the 'EXIT' signal was never
> dispatched, because the child process unlinked before it died; You can't
> wait forever for the 'EXIT' after you've seen a 'DOWN' with 'noproc' as the
> reason, so now you've got to choose how long to wait, but whatever timing
> works for one particular case isn't going to solve the general problem.
>
>
> We ran into something similar with our supervisor2 fork a while back,
> whilst terminating (multiple) simple children:
> http://hg.rabbitmq.com/rabbitmq-server/rev/812d71d0716c . That code is
> somewhat different though, not only because it was terminating multiple
> children (during shutdown) but also because it explicitly unlinks from the
> child *after* creating the monitor, and /still/ allowed for an EXIT signal
> to have made its way into the mailbox unexpectedly.
>
>
> The monitor_child/1 function also unlinks from the child after
> creating the monitor. That patch looks a little bit like the fixes I
> was trying. Basically it's checking for an EXIT message after
> receiving the DOWN, just in case one is in the mailbox, yes?
>
>
> That's correct.
>
> The problem is that it might still miss an EXIT, because it might still
> not have arrived yet, even though it will later.
>
>
> Yes that's definitely true and we were aware of that problem, however
> since we know we cannot wait for the 'EXIT' forever and whatever arbitrary
> timeout we choose is just someone else's race condition, we decided that if
> the EXIT signal wasn't delivered expediently to the process' mailbox, that
> loosing the real exit reason was something we could live with in the worst
> case.
>
> Since we've started merging the R15/R16 changes in though, that code has
> disappeared so we're in the same boat as you guys. :)
>
> Cheers,
> Tim
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20130513/ee36a7ec/attachment.htm>