Supervisor got noproc (looks like a bug)

Alexander Petrovsky askjuise@REDACTED
Thu Sep 9 12:00:30 CEST 2021


Hi!

I've carefully re-read the docs:
- https://erlang.org/doc/man/erlang.html#unlink-1
- https://erlang.org/doc/reference_manual/processes.html#links
- https://erlang.org/doc/apps/erts/erl_dist_protocol.html#link_protocol

And it seems you are absolutely right about the current situation and
it's a tricky race, not a bug:
(a) monitor request emitted and is still in flight (async nature).
(b) unlink the child (async nature):
 (b.1) sent UNLINK_ID and deactivate link (after this point all EXIT
messages from the linked process will be dropped);
 (b.2) linked process received UNLINK_ID;
 (b.3) receive UNLINK_ID_ACK and remove link state at all;
(a.1) monitor received 'noproc' message.

I found, the unlink protocol is changed in OTP 23, and there are old
and new protocols, the new states:
"The receiver of an UNLINK_ID signal responds with an UNLINK_ID_ACK
signal. Upon reception of an UNLINK_ID signal, the corresponding
UNLINK_ID_ACK signal must be sent before any other signals are sent to
the sender of the UNLINK_ID signal."

So, the linked process termination could happen:
- between (a) and (b), in this case, the EXIT message will be emitted
be placed into the mailbox;
- between (b.1) and (b.2), in this case, the message will be emitted,
but rejected due to the link is already deactivated;
- between (b.2) and (b.3), in this case, no messages could be emitted
by linked process accordingly to protocol.

It seems like, the behaviour of the monitor should be changed somehow,
and the code https://github.com/erlang/otp/blob/0bad25713b0bc4a875e9ef7d9b1abcb6a2f75061/lib/stdlib/src/supervisor.erl#L957-L982
seems a little bit outdated due to the async nature of the monitors
and such tricky race, also, it's 12 years old... :)

I would like to see, what others add OTP maintainers thinks about this
behaviour?

ср, 8 сент. 2021 г. в 13:52, Maria Scott <maria-12648430@REDACTED>:
>
> Hi :)
>
> first, this is partly guesswork, so take with a grain of salt.
>
> You have a situation where the child may be terminated by the supervisor (via terminate_child) and may at the same time be terminating by itself (via {stop, ...}), is that right?
>
> While your child is running, it is linked to the supervisor, but not monitored. When the supervisor is told to shut down (terminate) a child, what it does is this (simplified, see https://github.com/erlang/otp/blob/0bad25713b0bc4a875e9ef7d9b1abcb6a2f75061/lib/stdlib/src/supervisor.erl#L923-L982 for all the details):
> (a) monitor the child
> (b) unlink the child
> (c) check for an EXIT message (in case the child already terminated before the monitoring)
> (d) if there is an EXIT message, flush out the DOWN message and return the EXIT reason (and that's it in this case)
> (e) otherwise, if no EXIT message is there, call exit(Child, shutdown)
> (f) wait for a DOWN message; reasons shutdown and normal are normal exits, everything else produces a shutdown_error
>
> By only intuition, this flow should hold no matter if and when the child terminates by itself.
> The key to understanding how the shutdown_error you describe arises is this passage from the docs for monitor/2: "The monitor request is an asynchronous signal. That is, it takes time before the signal reaches its destination." unlink/1, while it is also an asynchronous request that takes time to reach the other process, does something more: it marks the link as inactive on the process calling unlink, and "The exit signal is silently dropped if ... the corresponding link has been deactivated".
>
> So what I think is happening when the error you describe occurs is this:
> - the supervisor calls monitor(process, Child) (see (a)), but the message does not reach the child immediately
> - the supervisor unlinks the child (see (b)), deactivating the link
> - the child dies (exits by itself as a result of {stop, ...}); but as it is now unlinked, there is no EXIT message (see (c) and (d))
> - the monitor signal reaches (or, doesn't rather) reach the child, resulting in a DOWN message with reason noproc
> - the supervisor receives the DOWN message (see (f)), and as the reason is not shutdown or normal, it gets propagated, ultimately resulting in the shutdown_error with reason noproc
>
> As I said, this is pieced together from some (educated) guesswork ;) Don't rely on it until somebody else confirms it.
>
> Kind regards,
> Maria



-- 
Alexander Petrovsky


More information about the erlang-questions mailing list