Supervisor got noproc (looks like a bug)

Rickard Green rickard@REDACTED
Thu Sep 9 22:37:22 CEST 2021


You are right in that there is a race causing a 'noproc' exit reason when
it should be possible to get the real exit reason. I'll write an internal
ticket about this, but you are welcome to create a bug issue at <
https://github.com/erlang/otp/issues> as well (as pointed out by Maria).

On Thu, Sep 9, 2021 at 12:49 PM Alexander Petrovsky <askjuise@REDACTED>
wrote:

> One more thing, taking into account the async nature of the monitors
> and links, I think the following statement should make sense: if the
> process with Pid dies after the monitor signal is sent, the DOWN
> message should have the real reason, not a 'noproc'.
>

No, in the distributed case we would need to keep exit reasons for a long
time (hard to determine how long) for all terminated processes in order to
satisfy such a behavior.

The behaviour is and should be: If the process with Pid dies after the
monitor signal has been *received*, the DOWN message should have the real
reason, not a 'noproc'. If the process is not alive at the time of the
reception of the monitor signal, you will get a 'noproc' reason.

Regards,
Rickard, Erlang/OTP


> чт, 9 сент. 2021 г. в 13:00, Alexander Petrovsky <askjuise@REDACTED>:
> >
> > Hi!
> >
> > I've carefully re-read the docs:
> > - https://erlang.org/doc/man/erlang.html#unlink-1
> > - https://erlang.org/doc/reference_manual/processes.html#links
> > - https://erlang.org/doc/apps/erts/erl_dist_protocol.html#link_protocol
> >
> > And it seems you are absolutely right about the current situation and
> > it's a tricky race, not a bug:
> > (a) monitor request emitted and is still in flight (async nature).
> > (b) unlink the child (async nature):
> >  (b.1) sent UNLINK_ID and deactivate link (after this point all EXIT
> > messages from the linked process will be dropped);
> >  (b.2) linked process received UNLINK_ID;
> >  (b.3) receive UNLINK_ID_ACK and remove link state at all;
> > (a.1) monitor received 'noproc' message.
> >
> > I found, the unlink protocol is changed in OTP 23, and there are old
> > and new protocols, the new states:
> > "The receiver of an UNLINK_ID signal responds with an UNLINK_ID_ACK
> > signal. Upon reception of an UNLINK_ID signal, the corresponding
> > UNLINK_ID_ACK signal must be sent before any other signals are sent to
> > the sender of the UNLINK_ID signal."
> >
> > So, the linked process termination could happen:
> > - between (a) and (b), in this case, the EXIT message will be emitted
> > be placed into the mailbox;
> > - between (b.1) and (b.2), in this case, the message will be emitted,
> > but rejected due to the link is already deactivated;
> > - between (b.2) and (b.3), in this case, no messages could be emitted
> > by linked process accordingly to protocol.
> >
> > It seems like, the behaviour of the monitor should be changed somehow,
> > and the code
> https://github.com/erlang/otp/blob/0bad25713b0bc4a875e9ef7d9b1abcb6a2f75061/lib/stdlib/src/supervisor.erl#L957-L982
> > seems a little bit outdated due to the async nature of the monitors
> > and such tricky race, also, it's 12 years old... :)
> >
> > I would like to see, what others add OTP maintainers thinks about this
> > behaviour?
> >
> > ср, 8 сент. 2021 г. в 13:52, Maria Scott <maria-12648430@REDACTED
> >:
> > >
> > > Hi :)
> > >
> > > first, this is partly guesswork, so take with a grain of salt.
> > >
> > > You have a situation where the child may be terminated by the
> supervisor (via terminate_child) and may at the same time be terminating by
> itself (via {stop, ...}), is that right?
> > >
> > > While your child is running, it is linked to the supervisor, but not
> monitored. When the supervisor is told to shut down (terminate) a child,
> what it does is this (simplified, see
> https://github.com/erlang/otp/blob/0bad25713b0bc4a875e9ef7d9b1abcb6a2f75061/lib/stdlib/src/supervisor.erl#L923-L982
> for all the details):
> > > (a) monitor the child
> > > (b) unlink the child
> > > (c) check for an EXIT message (in case the child already terminated
> before the monitoring)
> > > (d) if there is an EXIT message, flush out the DOWN message and return
> the EXIT reason (and that's it in this case)
> > > (e) otherwise, if no EXIT message is there, call exit(Child, shutdown)
> > > (f) wait for a DOWN message; reasons shutdown and normal are normal
> exits, everything else produces a shutdown_error
> > >
> > > By only intuition, this flow should hold no matter if and when the
> child terminates by itself.
> > > The key to understanding how the shutdown_error you describe arises is
> this passage from the docs for monitor/2: "The monitor request is an
> asynchronous signal. That is, it takes time before the signal reaches its
> destination." unlink/1, while it is also an asynchronous request that takes
> time to reach the other process, does something more: it marks the link as
> inactive on the process calling unlink, and "The exit signal is silently
> dropped if ... the corresponding link has been deactivated".
> > >
> > > So what I think is happening when the error you describe occurs is
> this:
> > > - the supervisor calls monitor(process, Child) (see (a)), but the
> message does not reach the child immediately
> > > - the supervisor unlinks the child (see (b)), deactivating the link
> > > - the child dies (exits by itself as a result of {stop, ...}); but as
> it is now unlinked, there is no EXIT message (see (c) and (d))
> > > - the monitor signal reaches (or, doesn't rather) reach the child,
> resulting in a DOWN message with reason noproc
> > > - the supervisor receives the DOWN message (see (f)), and as the
> reason is not shutdown or normal, it gets propagated, ultimately resulting
> in the shutdown_error with reason noproc
> > >
> > > As I said, this is pieced together from some (educated) guesswork ;)
> Don't rely on it until somebody else confirms it.
> > >
> > > Kind regards,
> > > Maria
> >
> >
> >
> > --
> > Alexander Petrovsky
>
>
>
> --
> Alexander Petrovsky
>


-- 
Rickard Green, Erlang/OTP, Ericsson AB
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20210909/9662dca2/attachment.htm>


More information about the erlang-questions mailing list