<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><span class="Apple-style-span" style="font-family: monospace; ">On 5 Jul 2013, at 20:10, Jonathan Leivent wrote:</span><span class="Apple-style-span" style="font-family: monospace; "><br></span><span class="Apple-style-span" style="font-family: monospace; "><br></span><blockquote type="cite" style="font-family: monospace; ">On 07/05/2013 01:33 PM, Tim Watson wrote:<br></blockquote><blockquote type="cite" style="font-family: monospace; "><blockquote type="cite">On 5 Jul 2013, at 18:00, Gleb Peregud wrote:<br></blockquote></blockquote><blockquote type="cite" style="font-family: monospace; "><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite" style="font-family: monospace; "><blockquote type="cite"><blockquote type="cite">If it was an intermittent network issue, TCP can mask the problem and<br></blockquote></blockquote></blockquote><blockquote type="cite" style="font-family: monospace; "><blockquote type="cite"><blockquote type="cite">Erlang would never know about it. And I believe Erlang depends on TCP<br></blockquote></blockquote></blockquote><blockquote type="cite" style="font-family: monospace; "><blockquote type="cite"><blockquote type="cite">and explicit pings to detect dead nodes. But if remote failed node has<br></blockquote></blockquote></blockquote><blockquote type="cite" style="font-family: monospace; "><blockquote type="cite"><blockquote type="cite">been restarted in mean time, Erlang will detect it as, IIRC, it<br></blockquote></blockquote></blockquote><blockquote type="cite" style="font-family: monospace; "><blockquote type="cite"><blockquote type="cite">maintains some kind of "node version" in it's distribution protocol<br></blockquote></blockquote></blockquote><blockquote type="cite" style="font-family: monospace; "><blockquote type="cite"><blockquote type="cite">state.<br></blockquote></blockquote></blockquote><blockquote type="cite" style="font-family: monospace; "><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite" style="font-family: monospace; "><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite" style="font-family: monospace; "><blockquote type="cite">That's correct, so issues can arise if the node goes away and comes back within net_ticktime and TCP masks the fact. You can also run into problems with pid re-use in this way.<br></blockquote></blockquote><blockquote type="cite" style="font-family: monospace; "><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite" style="font-family: monospace; "><blockquote type="cite">But perhaps more importantly than this, if a node goes away and the runtime does notice, there's no guarantee that you'll see the 'DOWN' (or nodedown) message in any particular order with regards other communications flying around the system. This can make it difficult to identify which messages might potentially need re-sending.<br></blockquote></blockquote><blockquote type="cite" style="font-family: monospace; "><br></blockquote><blockquote type="cite" style="font-family: monospace; ">Is this even true for net_kernel:monitor_nodes?  The doc suggests that there is an ordering between messages and nodeup/nodedown notifications, at least in later releases.<br></blockquote><blockquote type="cite" style="font-family: monospace; "><br></blockquote><span class="Apple-style-span" style="font-family: monospace; "><br></span><span class="Apple-style-span" style="font-family: monospace; ">That ordering guarantee only applies to the process that called monitor_nodes, and afaict it won't apply unless you're using (what the runtime deems to be) a new connection. If you get hit by the situation where the runtime doesn't notice the node has gone down, that's a different matter - as Gleb mentioned, TCP can mask a disconnect/reconnect and net_ticktime isn't instantaneous (for obvious reasons).</span><span class="Apple-style-span" style="font-family: monospace; "><br></span><span class="Apple-style-span" style="font-family: monospace; "><br></span><blockquote type="cite" style="font-family: monospace; "><blockquote type="cite">There are a couple of papers worth reading in this space:<br></blockquote></blockquote><blockquote type="cite" style="font-family: monospace; "><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite" style="font-family: monospace; "><blockquote type="cite">[1] Programming Distributed Erlang Applications: Pitfalls and Recipes (Hans Svensson, Lars-åke Fredlund)<br></blockquote></blockquote><blockquote type="cite" style="font-family: monospace; "><blockquote type="cite">[2] A Unified Semantics for Future Erlang (Hans Svensson, Lars-åke Fredlund, Clara Benac Earle)<br></blockquote></blockquote><blockquote type="cite" style="font-family: monospace; "><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite" style="font-family: monospace; "><br></blockquote><blockquote type="cite" style="font-family: monospace; ">Thanks - I just skimmed the first paper.  I don't have access to the second.<br></blockquote><span class="Apple-style-span" style="font-family: monospace; "><br></span><span class="Apple-style-span" style="font-family: monospace; ">You'll need an ACM (or similar) account. The first one's more relevant anyway.</span><span class="Apple-style-span" style="font-family: monospace; "><br></span><span class="Apple-style-span" style="font-family: monospace; "><br></span><span class="Apple-style-span" style="font-family: monospace; ">Cheers,</span><span class="Apple-style-span" style="font-family: monospace; "><br></span><span class="Apple-style-span" style="font-family: monospace; ">Tim</span></body></html>