[erlang-questions] can nodes fail/recover too fast to be seen?

Fri Jul 5 21:33:45 CEST 2013

Thanks for the clarifications Per - that's cleared up a few things that I was unaware of.

On 5 Jul 2013, at 20:23, Per Hedeland wrote:
>>> On Jul 5, 2013, at 5:22 PM, Tim Watson wrote:
>>> 
>>>> As i understand it, this can and does happen, because erlang does automatic reconnect in order to provide reliable communications.
> 
> No.

So is the Svensson and Frelund paper (viz [2] from my earlier post) incorrect in its assertion that messages between nodes can be dropped in the face of rapid node reconnects?

>>>>> In Erlang, is it possible for a monitored node to fail and recover so quickly that nodes monitoring it won't detect the failure?
> 
> No. The TCP connection to the old node instance cannot be used for
> communication with the new node instance, i.e. there is no way that
> communication with the new node instance can be established without the
> local VM generating node_down/'DOWN'/exit messages for the old instance.
> 

Just out of interest, is this enforced by epmd or internally? Also, it would be worth making this explicit in the documentation somewhere, since this question comes up frequently.

>>>>> Or, is there some kind of internal persistent state that prevents this?
> 
> This is where it potentially gets interesting - i.e. assuming *no*
> monitoring or linking - and that's where the "creation" part of a node
> identifier comes into play. If a distributed node restarts, it will get
> a new "creation" value courtesy of epmd, and any any pid() values
> referring to the old node instance will be invalid.
> 

Does this depend on epmd having stayed up and running the whole time, or does epmd now have some local persistent state?

Cheers,
Tim