Supervisor got noproc (looks like a bug)

Alexander Petrovsky askjuise@REDACTED
Thu Sep 9 12:49:34 CEST 2021


One more thing, taking into account the async nature of the monitors
and links, I think the following statement should make sense: if the
process with Pid dies after the monitor signal is sent, the DOWN
message should have the real reason, not a 'noproc'.

чт, 9 сент. 2021 г. в 13:00, Alexander Petrovsky <askjuise@REDACTED>:
>
> Hi!
>
> I've carefully re-read the docs:
> - https://erlang.org/doc/man/erlang.html#unlink-1
> - https://erlang.org/doc/reference_manual/processes.html#links
> - https://erlang.org/doc/apps/erts/erl_dist_protocol.html#link_protocol
>
> And it seems you are absolutely right about the current situation and
> it's a tricky race, not a bug:
> (a) monitor request emitted and is still in flight (async nature).
> (b) unlink the child (async nature):
>  (b.1) sent UNLINK_ID and deactivate link (after this point all EXIT
> messages from the linked process will be dropped);
>  (b.2) linked process received UNLINK_ID;
>  (b.3) receive UNLINK_ID_ACK and remove link state at all;
> (a.1) monitor received 'noproc' message.
>
> I found, the unlink protocol is changed in OTP 23, and there are old
> and new protocols, the new states:
> "The receiver of an UNLINK_ID signal responds with an UNLINK_ID_ACK
> signal. Upon reception of an UNLINK_ID signal, the corresponding
> UNLINK_ID_ACK signal must be sent before any other signals are sent to
> the sender of the UNLINK_ID signal."
>
> So, the linked process termination could happen:
> - between (a) and (b), in this case, the EXIT message will be emitted
> be placed into the mailbox;
> - between (b.1) and (b.2), in this case, the message will be emitted,
> but rejected due to the link is already deactivated;
> - between (b.2) and (b.3), in this case, no messages could be emitted
> by linked process accordingly to protocol.
>
> It seems like, the behaviour of the monitor should be changed somehow,
> and the code https://github.com/erlang/otp/blob/0bad25713b0bc4a875e9ef7d9b1abcb6a2f75061/lib/stdlib/src/supervisor.erl#L957-L982
> seems a little bit outdated due to the async nature of the monitors
> and such tricky race, also, it's 12 years old... :)
>
> I would like to see, what others add OTP maintainers thinks about this
> behaviour?
>
> ср, 8 сент. 2021 г. в 13:52, Maria Scott <maria-12648430@REDACTED>:
> >
> > Hi :)
> >
> > first, this is partly guesswork, so take with a grain of salt.
> >
> > You have a situation where the child may be terminated by the supervisor (via terminate_child) and may at the same time be terminating by itself (via {stop, ...}), is that right?
> >
> > While your child is running, it is linked to the supervisor, but not monitored. When the supervisor is told to shut down (terminate) a child, what it does is this (simplified, see https://github.com/erlang/otp/blob/0bad25713b0bc4a875e9ef7d9b1abcb6a2f75061/lib/stdlib/src/supervisor.erl#L923-L982 for all the details):
> > (a) monitor the child
> > (b) unlink the child
> > (c) check for an EXIT message (in case the child already terminated before the monitoring)
> > (d) if there is an EXIT message, flush out the DOWN message and return the EXIT reason (and that's it in this case)
> > (e) otherwise, if no EXIT message is there, call exit(Child, shutdown)
> > (f) wait for a DOWN message; reasons shutdown and normal are normal exits, everything else produces a shutdown_error
> >
> > By only intuition, this flow should hold no matter if and when the child terminates by itself.
> > The key to understanding how the shutdown_error you describe arises is this passage from the docs for monitor/2: "The monitor request is an asynchronous signal. That is, it takes time before the signal reaches its destination." unlink/1, while it is also an asynchronous request that takes time to reach the other process, does something more: it marks the link as inactive on the process calling unlink, and "The exit signal is silently dropped if ... the corresponding link has been deactivated".
> >
> > So what I think is happening when the error you describe occurs is this:
> > - the supervisor calls monitor(process, Child) (see (a)), but the message does not reach the child immediately
> > - the supervisor unlinks the child (see (b)), deactivating the link
> > - the child dies (exits by itself as a result of {stop, ...}); but as it is now unlinked, there is no EXIT message (see (c) and (d))
> > - the monitor signal reaches (or, doesn't rather) reach the child, resulting in a DOWN message with reason noproc
> > - the supervisor receives the DOWN message (see (f)), and as the reason is not shutdown or normal, it gets propagated, ultimately resulting in the shutdown_error with reason noproc
> >
> > As I said, this is pieced together from some (educated) guesswork ;) Don't rely on it until somebody else confirms it.
> >
> > Kind regards,
> > Maria
>
>
>
> --
> Alexander Petrovsky



-- 
Alexander Petrovsky


More information about the erlang-questions mailing list