[erlang-bugs] Supervisor terminate_child race

Robert Virding <>
Wed May 15 12:17:28 CEST 2013


Do you mean only using monitors in the supervisor, and no links? If so that would not work as you would then not get an exit signal automatically sent to the child when the supervisor dies. Which you do want. Or have I misunderstood you? 

Robert 

----- Original Message -----

> From: "Tim Watson" <>
> To: "Siri Hansen" <>
> Cc: 
> Sent: Wednesday, 15 May, 2013 10:03:31 AM
> Subject: Re: [erlang-bugs] Supervisor terminate_child race

> Switching to monitors is, IMHO a better approach, since using both is
> prone to races and links are open to be interfered with.

> Are there any disadvantages I've not thought of though? Or are you
> suggesting to do both from birth?

> On 14 May 2013, at 15:43, Siri Hansen <  > wrote:

> > Just a thought: would it be an option (and would it help) to
> > monitor
> > each child from birth?
> 
> > /siri
> 

> > 2013/5/13 Siri Hansen <  >
> 

> > > Bryan and Tim, your analysis is very good, and the problem is
> > > complicated. I don't see a "water tight" solution right now, and
> > > I
> > > can not spend too much time pondering without having a real
> > > priority
> > > for this case. I have written a ticket for it, and it will be
> > > prioritized along with all other backlog items. Any further
> > > thoughts
> > > and contributions will be very much appreciated :)
> > 
> 
> > > Thanks again
> > 
> 
> > > /siri
> > 
> 

> > > 2013/4/30 Tim Watson <  >
> > 
> 

> > > > Hi Bryan,
> > > 
> > 
> 

> > > > On 30 Apr 2013, at 18:34, Bryan Fink wrote:
> > > 
> > 
> 
> > > > > > But twiddling the timing there is just as racy, as you've
> > > > > > noticed,
> > > > > > right?
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > Correct. The length of the timeout is irrelevant. The EXIT
> > > > > signal
> > > > > is
> > > > 
> > > 
> > 
> 
> > > > > not guaranteed to arrive within any specific amount of time.
> > > > 
> > > 
> > 
> 

> > > > Indeed. Almost a halting problem this isn't it. :)
> > > 
> > 
> 

> > > > > > Isn't the point that the EXIT signal might /never/ come, if
> > > > > > the
> > > > > > child
> > > > > > un-links, or might come *after* the 'DOWN' if the race
> > > > > > you've
> > > > > > located occurs? Surely you've got to be able to handle
> > > > > > either
> > > > > > case?
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > Yes, the point of the monitor is to handle the case where the
> > > > > EXIT
> > > > 
> > > 
> > 
> 
> > > > > never comes (because the child unlinks). It is not the case,
> > > > > however,
> > > > 
> > > 
> > 
> 
> > > > > that the EXIT always arrives after the DOWN in the race I'm
> > > > > seeing.
> > > > 
> > > 
> > 
> 
> > > > > They might both be delayed.
> > > > 
> > > 
> > 
> 

> > > > Waiting without a timeout for the 'DOWN' is acceptable, because
> > > > you've got a guarantee (via the runtime) the it *will* arrive,
> > > > no
> > > > matter what state the target process was in when you created
> > > > the
> > > > monitor. Waiting some arbitrary time for the 'EXIT' is a real
> > > > problem though, because you could wait forever.
> > > 
> > 
> 

> > > > > Handling either order is important, but the problem with this
> > > > > race
> > > > > is
> > > > 
> > > 
> > 
> 
> > > > > that only the EXIT message contains the actual exit reason
> > > > > when
> > > > > this
> > > > 
> > > 
> > 
> 
> > > > > happens. The 'noproc' in the DOWN is just saying that there
> > > > > was
> > > > > no
> > > > 
> > > 
> > 
> 
> > > > > process to monitor.
> > > > 
> > > 
> > 
> 

> > > > Indeed. But it could equally be true that the 'EXIT' signal was
> > > > never
> > > > dispatched, because the child process unlinked before it died;
> > > > You
> > > > can't wait forever for the 'EXIT' after you've seen a 'DOWN'
> > > > with
> > > > 'noproc' as the reason, so now you've got to choose how long to
> > > > wait, but whatever timing works for one particular case isn't
> > > > going
> > > > to solve the general problem.
> > > 
> > 
> 

> > > > > > We ran into something similar with our supervisor2 fork a
> > > > > > while
> > > > > > back,
> > > > > > whilst terminating (multiple) simple children:
> > > > > > http://hg.rabbitmq.com/rabbitmq-server/rev/812d71d0716c .
> > > > > > That
> > > > > > code
> > > > > > is somewhat different though, not only because it was
> > > > > > terminating
> > > > > > multiple children (during shutdown) but also because it
> > > > > > explicitly
> > > > > > unlinks from the child *after* creating the monitor, and
> > > > > > /still/
> > > > > > allowed for an EXIT signal to have made its way into the
> > > > > > mailbox
> > > > > > unexpectedly.
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > The monitor_child/1 function also unlinks from the child
> > > > > after
> > > > 
> > > 
> > 
> 
> > > > > creating the monitor. That patch looks a little bit like the
> > > > > fixes
> > > > > I
> > > > 
> > > 
> > 
> 
> > > > > was trying. Basically it's checking for an EXIT message after
> > > > 
> > > 
> > 
> 
> > > > > receiving the DOWN, just in case one is in the mailbox, yes?
> > > > 
> > > 
> > 
> 
> > > > That's correct.
> > > 
> > 
> 

> > > > > The problem is that it might still miss an EXIT, because it
> > > > > might
> > > > > still
> > > > 
> > > 
> > 
> 
> > > > > not have arrived yet, even though it will later.
> > > > 
> > > 
> > 
> 

> > > > Yes that's definitely true and we were aware of that problem,
> > > > however
> > > > since we know we cannot wait for the 'EXIT' forever and
> > > > whatever
> > > > arbitrary timeout we choose is just someone else's race
> > > > condition,
> > > > we decided that if the EXIT signal wasn't delivered expediently
> > > > to
> > > > the process' mailbox, that loosing the real exit reason was
> > > > something we could live with in the worst case.
> > > 
> > 
> 

> > > > Since we've started merging the R15/R16 changes in though, that
> > > > code
> > > > has disappeared so we're in the same boat as you guys. :)
> > > 
> > 
> 

> > > > Cheers,
> > > 
> > 
> 
> > > > Tim
> > > 
> > 
> 

> _______________________________________________
> erlang-bugs mailing list
> 
> http://erlang.org/mailman/listinfo/erlang-bugs
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20130515/263951e6/attachment.html>


More information about the erlang-bugs mailing list