[erlang-bugs] Supervisor terminate_child race

Wed May 15 17:11:17 CEST 2013

On 15 May 2013, at 14:54, Siri Hansen wrote:

> Then again... it is up to the child's start function to create the link, and from the supervisor's point of view, the only place to add the monitor would be when the start function returns - which would be just another place to get a race :(
> 

Well quite. *sigh*

Perhaps what we do in cloud haskell might be instructive after all, though the approach runs counter to the APIs which its OTP forebears use. Our supervisor performs the actual `spawn' itself, so the child spec provides as its startup term (which is roughly equivalent to an MFArgs tuple), not a function which spawns the new process, but rather the code for its process' main loop. The disadvantage here is that the API is more constrained (in terms of what the main loop looks like), but the advantage is the supervisor can insert arbitrary code into the start phase of the child's server loop. Thus our supervisor performs the link (from child to parent) and forces the child process to wait until the monitor is set up correctly, before actually entering its loop. This ensures we don't end up with a race with regards startup, linking and monitor establishment. There relevant bit of the code looks roughly like this:

wrapClosure proc spec' =
  let chId = childKey spec' in do
    supervisor <- getSelfPid
    pid <- spawnLocal $ do
      link supervisor    -- die if our parent dies
      () <- expect        -- wait for a start signal
      proc >>= checkExitType chId  -- evaluate the child's loop
    void $ monitor pid  -- synchronous call to establish a monitor
    send pid ()    -- tell the child to go into its main loop
    return $ Right $ ChildRunning pid

Of course, because of this design, our gen_server API looks completely different! The start function, for example, doesn't spawn a process, but rather evaluates the `init' callback and enters the gen server's main loop (or crashes) immediately with the return value, leaving the `spawn' part to its clients. The supervisor is, of course, one of these clients: In fact our supervisor, like it's OTP inspiration, is itself a gen_server (we call them managed processes), and thus its start function never returns either:

-- | Starts a supervisor. 
...
start :: RestartStrategy -> [ChildSpec] -> ManagedProcessLoop SupervisorState
start strategy' specs' = ManagedProcess.start (strategy', specs') supInit serverDefinition

Now obviously, given that Erlang has been used in the real world for decades, we can't go changing gen server's start_link or the supervisor child spec APIs. But is there a way to achieve something similar without carving things up too much? I'm struggling to think of one, but it would good if we could avoid the race altogether.

Cheers,
Tim

> 
> 2013/5/15 Tim Watson <watson.timothy@REDACTED>
> On 15 May 2013, at 11:17, Robert Virding wrote:
> 
>> Do you mean only using monitors in the supervisor, and no links? If so that would not work as you would then not get an exit signal automatically sent to the child when the supervisor dies.
>> Which you do want. Or have I misunderstood you?
>> 
> 
> Oh gosh, how embarrasing. I was thinking in terms of Uni-directional Links (viz A Unified Semantics for Future Erlang, Svensson et al), and linking child to parent (so as to propagate supervisor exits) but not the other way around. Of course we can't do that - just ignore this suggestion. [note: I've been implementing the supervisor API for cloud haskell in my spare time and got confused between those semantics (viz http://haskell-distributed.github.io/static/semantics.pdf) and what I do for a day job in the *real world*].
> 
> But switching all the supervisor's signal handling to rely on monitor notifications rather than trapped exits (which might be ignored) sounds good to me. The use of linking would be there to guarantee supervisor death is propagated correctly, but we could switch away from handling child 'EXIT' signals to handling 'DOWN' notifications instead. This would IMO be a bit cleaner.
> 
> Cheers,
> Tim
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20130515/954b1ace/attachment.htm>