[erlang-questions] can nodes fail/recover too fast to be seen?

Sun Jul 7 16:44:11 CEST 2013

On 5 Jul 2013, at 21:07, Per Hedeland wrote:
> Tim Watson <watson.timothy@REDACTED> wrote:
>> 
>> On 5 Jul 2013, at 20:23, Per Hedeland wrote:
>>>>> On Jul 5, 2013, at 5:22 PM, Tim Watson wrote:
>>>>> 
>>>>>> As i understand it, this can and does happen, because erlang does automatic reconnect in order to provide reliable communications.
>>> 
>>> No.
>> 
>> So is the Svensson and Frelund paper (viz [2] from my earlier post) incorrect in its assertion that messages between nodes can be dropped in the face of rapid node reconnects?
> 
> No, of course they can be dropped - if the destination node goes down,
> it's impossible to know whether a given, sent message was a) received at
> the remote *host*, b) received by the remote Erlang node, c) received by
> the remote Erlang process, d) processed by the remote Erlang process, or
> e) none of the above. But if you monitor/link (e.g. use
> gen_server:call()), you will know that "badness happened", and can take
> corrective action.

Right, that makes sense. The bit about "signal loss" that I was missing there is that monitoring (or linking) *is* guaranteed in the situation where a node disconnects and reconnects quickly.

> "Re-sending only messages that need to be re-sent" is
> not possible in general, and this is not specific to Erlang distribution.
> 

That's not what I was talking about, since we know sending is meant to be asynchronous and never fail - I was thinking about both ends, not just sender or receiver. The point I was making is that even if monitor_nodes can guarantee that you'll see nodedown before any other traffic over a new connection, you can only guarantee that for a process that actually called monitor_nodes (or is monitoring the remote process), since if you rely on some *other* process to do monitoring, then there's no ordering guarantee to be had. In other words (as the paper puts it) you have to monitor each interaction to be sure. And this holds, more importantly, for receipt as well. So if you aren't monitoring a remote process (or node) are you're holding on to a remote pid, then you've got to have your wits about you.

Anyway, it's good to know that the monitoring guarantees are absolute.

Cheers,
Tim