Supervisor got noproc (looks like a bug)

Fri Sep 10 21:40:34 CEST 2021

On Fri, Sep 10, 2021 at 2:23 PM Maria Scott <maria-12648430@REDACTED>
wrote:

> > > One more thing, taking into account the async nature of the monitors
> > >  and links, I think the following statement should make sense: if the
> > >  process with Pid dies after the monitor signal is sent, the DOWN
> > >  message should have the real reason, not a 'noproc'.
>
> Within the constraint you outline, "process dies after the monitor signal
> was sent", yes, it would make sense. But...
>
> > No, in the distributed case we would need to keep exit reasons for a
> long time (hard to determine how long) for all terminated processes in
> order to satisfy such a behavior.
>
> ... what Rickard said, plus to achieve this the solution would have to use
> an order of events in global time, ie what happened in process A happened
> before what happened in process B, which is practically impossible to do.
>
>
>
> Anyway, I have been rolling this over in my mind a bit, in the context of
> the supervisor terminate-a-child behavior, and... is it _necessary_ to
> unlink the child after setting a monitor on it? Isn't it in fact a bit
> dangerous even? What if the supervisor crashes or gets killed right after
> unlinking the child? It will be left running, unaware of the fact that it
> has become an orphan. Correct?
>
>
Yes

> So instead, what if we just keep the link?

Yes one wants to keep the link as long as possible, but we eventually want
to perform the unlink unless we see an 'EXIT' message.

> (The following assumes that there will never be messages from a process
> after the 'EXIT' message, something which I think I did read somewhere
> once, but can't find right now).
>
>
Yes. The signal order guarantee <
https://erlang.org/doc/reference_manual/processes.html#signal-delivery>
promise that two signals sent from one process to another are received in
the same order as sent (if both are received). 'EXIT' and 'DOWN' signals
are sent after the process has entered an exiting state, so signals sent by
the process itself when it was alive, such as normal messages, have been
sent before 'EXIT' and 'DOWN' signals. There might however be other signals
sent on behalf of the terminated process after 'EXIT' and 'DOWN' signals
have been sent. For example, a process-info-reply informing the sender of a
process-info-request that the process is not alive. Note that the order
between 'EXIT' and 'DOWN' signals from a process is undefined. That is, if
you have both a link and a monitor to a process, you don't know which one
will be received first of the 'DOWN' and the 'EXIT' signals.

If the child is well-behaved (has not unlinked itself), at the supervisor
> side after it told the child to exit, what we can expect to receive is
> either:
> a) {'EXIT', ..., Reason} followed by {'DOWN', ..., noproc} if the child
> died before the monitor signal reached it
> b) {'DOWN', ..., Reason} followed by {'EXIT', ..., Reason} if the monitor
> signal has reached the child before it died, or
>
> If the child is naughty (unlinked itself), things are a bit trickier:
> c) {'DOWN', ..., noproc} if the child died before the monitor signal
> reached it, or
> d) {'DOWN', ..., Reason} _not_ followed by an 'EXIT' message
>
> The following approach should take care of all of the above cases, I think:
>
> * set up a monitor on the child, and leave the link in place
> * send shutdown (+kill on shutdown timeout) or kill, whatever the shutdown
> strategy of the child requires
> * use a selective receive with two clauses, one for 'EXIT', one for
> 'DOWN', and...
>   * if {'EXIT', ...} is received first, we have case (a), and can just
> flush out the associated 'DOWN' message via demonitor with flush
>   * if a {'DOWN', ..., noproc} message is received first, we have case c).
> The child is gone, and we will never get at the exit reason
>   * if a {'DOWN', ..., Reason} message is received, we have either case b)
> or d). We know the child exited and for what reason. We try to flush out
> the possibly existing 'EXIT' message, which may or may not be there
>
> What do you think? Make any sense?
>

Yes it makes sense, but in case we see no 'EXIT' message after a 'DOWN'
message we want to perform an unlink and flush the message queue for an
'EXIT' message. If we got a 'DOWN' message with 'noproc' and an 'EXIT'
message appears (in a flush, or before) we use the exit reason of the
'EXIT' message. If the 'DOWN' message with a non-noproc reason arrived we
can just ignore the exit reason in the 'EXIT' message if it should have
arrived.

Without the unlink and 'EXIT' message flush, we might end up with a stray
'EXIT' message in the message queue of the supervisor. If that is a problem
or not for the superviser I do however not know.

> Regards,
> Maria
>

Regards,
Rickard
-- 
Rickard Green, Erlang/OTP, Ericsson AB
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20210910/85ab0bfb/attachment.htm>