[erlang-questions] can nodes fail/recover too fast to be seen?
Fri Jul 5 22:07:05 CEST 2013
Tim Watson <watson.timothy@REDACTED> wrote:
>On 5 Jul 2013, at 20:23, Per Hedeland wrote:
>>>> On Jul 5, 2013, at 5:22 PM, Tim Watson wrote:
>>>>> As i understand it, this can and does happen, because erlang does automatic reconnect in order to provide reliable communications.
>So is the Svensson and Frelund paper (viz  from my earlier post) incorrect in its assertion that messages between nodes can be dropped in the face of rapid node reconnects?
No, of course they can be dropped - if the destination node goes down,
it's impossible to know whether a given, sent message was a) received at
the remote *host*, b) received by the remote Erlang node, c) received by
the remote Erlang process, d) processed by the remote Erlang process, or
e) none of the above. But if you monitor/link (e.g. use
gen_server:call()), you will know that "badness happened", and can take
corrective action. "Re-sending only messages that need to be re-sent" is
not possible in general, and this is not specific to Erlang distribution.
See also http://www.erlang.org/faq/academic.html#id58000, which Matthias
Lang was kind enough to write up in a nice form based on some ramblings
of mine in the distant past. It could probably use s/link/monitor/, but
the general principle holds.
>>>>>> In Erlang, is it possible for a monitored node to fail and recover so quickly that nodes monitoring it won't detect the failure?
>> No. The TCP connection to the old node instance cannot be used for
>> communication with the new node instance, i.e. there is no way that
>> communication with the new node instance can be established without the
>> local VM generating node_down/'DOWN'/exit messages for the old instance.
>Just out of interest, is this enforced by epmd or internally?
epmd has no role in inter-node communication once the connection has
been established. TCP enforces "cannot be used for ...". The
VM/net_kernel will not make a new connection until it has decided that
the old one isn't working any more, and at that point it will generate
the node_down/'DOWN'/exit messages.
>>>>>> Or, is there some kind of internal persistent state that prevents this?
>> This is where it potentially gets interesting - i.e. assuming *no*
>> monitoring or linking - and that's where the "creation" part of a node
>> identifier comes into play. If a distributed node restarts, it will get
>> a new "creation" value courtesy of epmd, and any any pid() values
>> referring to the old node instance will be invalid.
>Does this depend on epmd having stayed up and running the whole time, or does epmd now have some local persistent state?
Good point - it depends on epmd having stayed up and running, i.e. if
the *host* reboots, there is a 25% possibility of the new node instance
getting the same "creation" value. However, see the FAQ again - if your
communication is critical, you can't depend on "creation" - it won't
tell you about failures anyway. It's just a way to try to prevent that
messages get delivered to the wrong process.
More information about the erlang-questions