[erlang-questions] can nodes fail/recover too fast to be seen?

Fri Jul 5 19:33:23 CEST 2013

On 5 Jul 2013, at 18:00, Gleb Peregud wrote:

> If it was an intermittent network issue, TCP can mask the problem and
> Erlang would never know about it. And I believe Erlang depends on TCP
> and explicit pings to detect dead nodes. But if remote failed node has
> been restarted in mean time, Erlang will detect it as, IIRC, it
> maintains some kind of "node version" in it's distribution protocol
> state.
> 

That's correct, so issues can arise if the node goes away and comes back within net_ticktime and TCP masks the fact. You can also run into problems with pid re-use in this way.

But perhaps more importantly than this, if a node goes away and the runtime does notice, there's no guarantee that you'll see the 'DOWN' (or nodedown) message in any particular order with regards other communications flying around the system. This can make it difficult to identify which messages might potentially need re-sending.

There are a couple of papers worth reading in this space:

[1] Programming Distributed Erlang Applications: Pitfalls and Recipes (Hans Svensson, Lars-åke Fredlund)
[2] A Unified Semantics for Future Erlang (Hans Svensson, Lars-åke Fredlund, Clara Benac Earle)

Cheers,
Tim