[erlang-questions] can nodes fail/recover too fast to be seen?
Fri Jul 5 21:21:06 CEST 2013
On 5 Jul 2013, at 20:10, Jonathan Leivent wrote:
> On 07/05/2013 01:33 PM, Tim Watson wrote:
>> On 5 Jul 2013, at 18:00, Gleb Peregud wrote:
>>> If it was an intermittent network issue, TCP can mask the problem and
>>> Erlang would never know about it. And I believe Erlang depends on TCP
>>> and explicit pings to detect dead nodes. But if remote failed node has
>>> been restarted in mean time, Erlang will detect it as, IIRC, it
>>> maintains some kind of "node version" in it's distribution protocol
>> That's correct, so issues can arise if the node goes away and comes back within net_ticktime and TCP masks the fact. You can also run into problems with pid re-use in this way.
>> But perhaps more importantly than this, if a node goes away and the runtime does notice, there's no guarantee that you'll see the 'DOWN' (or nodedown) message in any particular order with regards other communications flying around the system. This can make it difficult to identify which messages might potentially need re-sending.
> Is this even true for net_kernel:monitor_nodes? The doc suggests that there is an ordering between messages and nodeup/nodedown notifications, at least in later releases.
That ordering guarantee only applies to the process that called monitor_nodes, and afaict it won't apply unless you're using (what the runtime deems to be) a new connection. If you get hit by the situation where the runtime doesn't notice the node has gone down, that's a different matter - as Gleb mentioned, TCP can mask a disconnect/reconnect and net_ticktime isn't instantaneous (for obvious reasons).
>> There are a couple of papers worth reading in this space:
>>  Programming Distributed Erlang Applications: Pitfalls and Recipes (Hans Svensson, Lars-åke Fredlund)
>>  A Unified Semantics for Future Erlang (Hans Svensson, Lars-åke Fredlund, Clara Benac Earle)
> Thanks - I just skimmed the first paper. I don't have access to the second.
You'll need an ACM (or similar) account. The first one's more relevant anyway.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the erlang-questions