<div dir="ltr"><div>You are right in that there is a race causing a 'noproc' exit reason when it should be possible to get the real exit reason. I'll write an internal ticket about this, but you are welcome to create a bug issue at <<a href="https://github.com/erlang/otp/issues">https://github.com/erlang/otp/issues</a>> as well (as pointed out by Maria).<br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Sep 9, 2021 at 12:49 PM Alexander Petrovsky <<a href="mailto:askjuise@gmail.com">askjuise@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">One more thing, taking into account the async nature of the monitors<br>
and links, I think the following statement should make sense: if the<br>
process with Pid dies after the monitor signal is sent, the DOWN<br>
message should have the real reason, not a 'noproc'.<br></blockquote><div><div><br></div><div>No, in the distributed case we would need
to keep exit reasons for a long time (hard to determine how long) for all terminated processes in order to
satisfy such a behavior.</div><div><br></div></div><div>The behaviour is and should be: If the process with Pid dies after the monitor signal has been *received*, the DOWN message should have the real reason, not a 'noproc'. If the process is not alive at the time of the reception of the monitor signal, you will get a 'noproc' reason.<br></div><div> </div><div>Regards,</div><div>Rickard, Erlang/OTP<br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
чт, 9 сент. 2021 г. в 13:00, Alexander Petrovsky <<a href="mailto:askjuise@gmail.com" target="_blank">askjuise@gmail.com</a>>:<br>
><br>
> Hi!<br>
><br>
> I've carefully re-read the docs:<br>
> - <a href="https://erlang.org/doc/man/erlang.html#unlink-1" rel="noreferrer" target="_blank">https://erlang.org/doc/man/erlang.html#unlink-1</a><br>
> - <a href="https://erlang.org/doc/reference_manual/processes.html#links" rel="noreferrer" target="_blank">https://erlang.org/doc/reference_manual/processes.html#links</a><br>
> - <a href="https://erlang.org/doc/apps/erts/erl_dist_protocol.html#link_protocol" rel="noreferrer" target="_blank">https://erlang.org/doc/apps/erts/erl_dist_protocol.html#link_protocol</a><br>
><br>
> And it seems you are absolutely right about the current situation and<br>
> it's a tricky race, not a bug:<br>
> (a) monitor request emitted and is still in flight (async nature).<br>
> (b) unlink the child (async nature):<br>
> (b.1) sent UNLINK_ID and deactivate link (after this point all EXIT<br>
> messages from the linked process will be dropped);<br>
> (b.2) linked process received UNLINK_ID;<br>
> (b.3) receive UNLINK_ID_ACK and remove link state at all;<br>
> (a.1) monitor received 'noproc' message.<br>
><br>
> I found, the unlink protocol is changed in OTP 23, and there are old<br>
> and new protocols, the new states:<br>
> "The receiver of an UNLINK_ID signal responds with an UNLINK_ID_ACK<br>
> signal. Upon reception of an UNLINK_ID signal, the corresponding<br>
> UNLINK_ID_ACK signal must be sent before any other signals are sent to<br>
> the sender of the UNLINK_ID signal."<br>
><br>
> So, the linked process termination could happen:<br>
> - between (a) and (b), in this case, the EXIT message will be emitted<br>
> be placed into the mailbox;<br>
> - between (b.1) and (b.2), in this case, the message will be emitted,<br>
> but rejected due to the link is already deactivated;<br>
> - between (b.2) and (b.3), in this case, no messages could be emitted<br>
> by linked process accordingly to protocol.<br>
><br>
> It seems like, the behaviour of the monitor should be changed somehow,<br>
> and the code <a href="https://github.com/erlang/otp/blob/0bad25713b0bc4a875e9ef7d9b1abcb6a2f75061/lib/stdlib/src/supervisor.erl#L957-L982" rel="noreferrer" target="_blank">https://github.com/erlang/otp/blob/0bad25713b0bc4a875e9ef7d9b1abcb6a2f75061/lib/stdlib/src/supervisor.erl#L957-L982</a><br>
> seems a little bit outdated due to the async nature of the monitors<br>
> and such tricky race, also, it's 12 years old... :)<br>
><br>
> I would like to see, what others add OTP maintainers thinks about this<br>
> behaviour?<br>
><br>
> ср, 8 сент. 2021 г. в 13:52, Maria Scott <<a href="mailto:maria-12648430@hnc-agency.org" target="_blank">maria-12648430@hnc-agency.org</a>>:<br>
> ><br>
> > Hi :)<br>
> ><br>
> > first, this is partly guesswork, so take with a grain of salt.<br>
> ><br>
> > You have a situation where the child may be terminated by the supervisor (via terminate_child) and may at the same time be terminating by itself (via {stop, ...}), is that right?<br>
> ><br>
> > While your child is running, it is linked to the supervisor, but not monitored. When the supervisor is told to shut down (terminate) a child, what it does is this (simplified, see <a href="https://github.com/erlang/otp/blob/0bad25713b0bc4a875e9ef7d9b1abcb6a2f75061/lib/stdlib/src/supervisor.erl#L923-L982" rel="noreferrer" target="_blank">https://github.com/erlang/otp/blob/0bad25713b0bc4a875e9ef7d9b1abcb6a2f75061/lib/stdlib/src/supervisor.erl#L923-L982</a> for all the details):<br>
> > (a) monitor the child<br>
> > (b) unlink the child<br>
> > (c) check for an EXIT message (in case the child already terminated before the monitoring)<br>
> > (d) if there is an EXIT message, flush out the DOWN message and return the EXIT reason (and that's it in this case)<br>
> > (e) otherwise, if no EXIT message is there, call exit(Child, shutdown)<br>
> > (f) wait for a DOWN message; reasons shutdown and normal are normal exits, everything else produces a shutdown_error<br>
> ><br>
> > By only intuition, this flow should hold no matter if and when the child terminates by itself.<br>
> > The key to understanding how the shutdown_error you describe arises is this passage from the docs for monitor/2: "The monitor request is an asynchronous signal. That is, it takes time before the signal reaches its destination." unlink/1, while it is also an asynchronous request that takes time to reach the other process, does something more: it marks the link as inactive on the process calling unlink, and "The exit signal is silently dropped if ... the corresponding link has been deactivated".<br>
> ><br>
> > So what I think is happening when the error you describe occurs is this:<br>
> > - the supervisor calls monitor(process, Child) (see (a)), but the message does not reach the child immediately<br>
> > - the supervisor unlinks the child (see (b)), deactivating the link<br>
> > - the child dies (exits by itself as a result of {stop, ...}); but as it is now unlinked, there is no EXIT message (see (c) and (d))<br>
> > - the monitor signal reaches (or, doesn't rather) reach the child, resulting in a DOWN message with reason noproc<br>
> > - the supervisor receives the DOWN message (see (f)), and as the reason is not shutdown or normal, it gets propagated, ultimately resulting in the shutdown_error with reason noproc<br>
> ><br>
> > As I said, this is pieced together from some (educated) guesswork ;) Don't rely on it until somebody else confirms it.<br>
> ><br>
> > Kind regards,<br>
> > Maria<br>
><br>
><br>
><br>
> --<br>
> Alexander Petrovsky<br>
<br>
<br>
<br>
-- <br>
Alexander Petrovsky<br>
</blockquote></div><br clear="all"><br>-- <br><div dir="ltr" class="gmail_signature">Rickard Green, Erlang/OTP, Ericsson AB</div></div>