Supervisor got noproc (looks like a bug)

Fri Sep 10 14:23:30 CEST 2021

> > One more thing, taking into account the async nature of the monitors
> >  and links, I think the following statement should make sense: if the
> >  process with Pid dies after the monitor signal is sent, the DOWN
> >  message should have the real reason, not a 'noproc'.

Within the constraint you outline, "process dies after the monitor signal was sent", yes, it would make sense. But...

> No, in the distributed case we would need to keep exit reasons for a long time (hard to determine how long) for all terminated processes in order to satisfy such a behavior.

... what Rickard said, plus to achieve this the solution would have to use an order of events in global time, ie what happened in process A happened before what happened in process B, which is practically impossible to do.

Anyway, I have been rolling this over in my mind a bit, in the context of the supervisor terminate-a-child behavior, and... is it _necessary_ to unlink the child after setting a monitor on it? Isn't it in fact a bit dangerous even? What if the supervisor crashes or gets killed right after unlinking the child? It will be left running, unaware of the fact that it has become an orphan. Correct?

So instead, what if we just keep the link? (The following assumes that there will never be messages from a process after the 'EXIT' message, something which I think I did read somewhere once, but can't find right now).

If the child is well-behaved (has not unlinked itself), at the supervisor side after it told the child to exit, what we can expect to receive is either:
a) {'EXIT', ..., Reason} followed by {'DOWN', ..., noproc} if the child died before the monitor signal reached it
b) {'DOWN', ..., Reason} followed by {'EXIT', ..., Reason} if the monitor signal has reached the child before it died, or

If the child is naughty (unlinked itself), things are a bit trickier:
c) {'DOWN', ..., noproc} if the child died before the monitor signal reached it, or
d) {'DOWN', ..., Reason} _not_ followed by an 'EXIT' message

The following approach should take care of all of the above cases, I think:

* set up a monitor on the child, and leave the link in place
* send shutdown (+kill on shutdown timeout) or kill, whatever the shutdown strategy of the child requires
* use a selective receive with two clauses, one for 'EXIT', one for 'DOWN', and...
  * if {'EXIT', ...} is received first, we have case (a), and can just flush out the associated 'DOWN' message via demonitor with flush
  * if a {'DOWN', ..., noproc} message is received first, we have case c). The child is gone, and we will never get at the exit reason
  * if a {'DOWN', ..., Reason} message is received, we have either case b) or d). We know the child exited and for what reason. We try to flush out the possibly existing 'EXIT' message, which may or may not be there

What do you think? Make any sense?
Regards,
Maria