<div dir="ltr">Bryan and Tim, your analysis is very good, and the problem is complicated. I don't see a "water tight" solution right now, and I can not spend too much time pondering without having a real priority for this case. I have written a ticket for it, and it will be prioritized along with all other backlog items. Any further thoughts and contributions will be very much appreciated :)<div style>
Thanks again</div><div style>/siri</div></div><div class="gmail_extra"><br><br><div class="gmail_quote">2013/4/30 Tim Watson <span dir="ltr"><<a href="mailto:watson.timothy@gmail.com" target="_blank">watson.timothy@gmail.com</a>></span><br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word">Hi Bryan,<div><br><div><div class="im"><div>On 30 Apr 2013, at 18:34, Bryan Fink wrote:</div>
<blockquote type="cite"><div><blockquote type="cite"><font color="#000000"><br></font></blockquote><blockquote type="cite">But twiddling the timing there is just as racy, as you've noticed, right?<br></blockquote><br>
Correct. The length of the timeout is irrelevant. The EXIT signal is<br>not guaranteed to arrive within any specific amount of time.<br><br></div></blockquote><div><br></div></div><div>Indeed. Almost a halting problem this isn't it. :)</div>
<div class="im"><br><blockquote type="cite"><div><blockquote type="cite"><br></blockquote><blockquote type="cite">Isn't the point that the EXIT signal might /never/ come, if the child un-links, or might come *after* the 'DOWN' if the race you've located occurs? Surely you've got to be able to handle either case?<br>
</blockquote><br>Yes, the point of the monitor is to handle the case where the EXIT<br>never comes (because the child unlinks). It is not the case, however,<br>that the EXIT always arrives after the DOWN in the race I'm seeing.<br>
They might both be delayed.<br><br></div></blockquote><div><br></div></div><div>Waiting without a timeout for the 'DOWN' is acceptable, because you've got a guarantee (via the runtime) the it *will* arrive, no matter what state the target process was in when you created the monitor. Waiting some arbitrary time for the 'EXIT' is a real problem though, because you could wait forever.</div>
<div class="im"><br><blockquote type="cite"><div>Handling either order is important, but the problem with this race is<br>that only the EXIT message contains the actual exit reason when this<br>happens. The 'noproc' in the DOWN is just saying that there was no<br>
process to monitor.<br></div></blockquote><div><br></div></div><div>Indeed. But it could equally be true that the 'EXIT' signal was never dispatched, because the child process unlinked before it died; You can't wait forever for the 'EXIT' after you've seen a 'DOWN' with 'noproc' as the reason, so now you've got to choose how long to wait, but whatever timing works for one particular case isn't going to solve the general problem.</div>
<div class="im"><br><blockquote type="cite"><div><br><blockquote type="cite">We ran into something similar with our supervisor2 fork a while back, whilst terminating (multiple) simple children: <a href="http://hg.rabbitmq.com/rabbitmq-server/rev/812d71d0716c" target="_blank">http://hg.rabbitmq.com/rabbitmq-server/rev/812d71d0716c</a> . That code is somewhat different though, not only because it was terminating multiple children (during shutdown) but also because it explicitly unlinks from the child *after* creating the monitor, and /still/ allowed for an EXIT signal to have made its way into the mailbox unexpectedly.<br>
</blockquote><br>The monitor_child/1 function also unlinks from the child after<br>creating the monitor. That patch looks a little bit like the fixes I<br>was trying. Basically it's checking for an EXIT message after<br>
receiving the DOWN, just in case one is in the mailbox, yes?</div></blockquote><div><br></div></div><div>That's correct. </div><div class="im"><br><blockquote type="cite"><div> The problem is that it might still miss an EXIT, because it might still<br>
not have arrived yet, even though it will later.<br><br></div></blockquote><div><br></div></div><div>Yes that's definitely true and we were aware of that problem, however since we know we cannot wait for the 'EXIT' forever and whatever arbitrary timeout we choose is just someone else's race condition, we decided that if the EXIT signal wasn't delivered expediently to the process' mailbox, that loosing the real exit reason was something we could live with in the worst case.</div>
<div><br></div><div>Since we've started merging the R15/R16 changes in though, that code has disappeared so we're in the same boat as you guys. :)</div><div><br></div><div>Cheers,</div><div>Tim</div><div><br></div>
</div></div></div></blockquote></div><br></div>