[erlang-questions] can nodes fail/recover too fast to be seen?

Fri Jul 5 22:03:13 CEST 2013

>>>>> In Erlang, is it possible for a monitored node to fail and recover so quickly that nodes monitoring it won't detect the failure?
>
> No. The TCP connection to the old node instance cannot be used for
> communication with the new node instance, i.e. there is no way that
> communication with the new node instance can be established without the
> local VM generating node_down/'DOWN'/exit messages for the old instance.

OK - so it is TCP that saves the day.  I'm not sure I want to be that 
dependent on TCP, though.  But, for now, it's an convenient solution.

>
>>>>>   Or, is there some kind of internal persistent state that prevents this?
>
> This is where it potentially gets interesting - i.e. assuming *no*
> monitoring or linking - and that's where the "creation" part of a node
> identifier comes into play. If a distributed node restarts, it will get
> a new "creation" value courtesy of epmd, and any any pid() values
> referring to the old node instance will be invalid.
>
> However the "creation" is only 2 bits, so if a node restarts frequently,
> old pid() values may become "valid" once again, i.e. referring to a new,
> different process. Is it a problem in practice, outside academic
> research papers? I've never heard of such a case. If you really have
> this problem, you should probably look at why the node keeps
> restarting...
>
> --Per Hedeland

That's interesting.  I'll make sure I monitor a node before relying on 
pid values from that node.

Thanks,
Jonathan