[erlang-questions] monitor and failure detectors
Johan Montelius
johanmon@REDACTED
Mon Jun 7 10:09:14 CEST 2010
A DOWN message does not give you very much information about what happened
but it might be that no DOWN message does give you a lot of information.
I would like if the semantics of Erlang where such that:
If a sequence of messages is potentially lost from A to B and B is
monitoring A the B will receive a DOWN-noconnection message. The DOWN
message is delivered to B after the last message that was reliably
delivered and before any message sent after the potentially lost sequence
of messages.
If this is the case then it does mean a lot. As long as B does not
receive DOWN-noconnection message it can live in a world where messages
are reliably delivered from A to B.
Is this the semantics of Erlang today?
Johan
On Mon, 07 Jun 2010 09:10:56 +0200, Jayson Vantuyl <kagato@REDACTED>
wrote:
> In any case, there is a race condition between when the connection dies
> and the message is handled. Reading too much into the message is
> trouble.
>
> It's what I like to call a "volatile" message. It means only that a
> failure detector kicked in, and it doesn't actually contribute much
> information other than that a delay of a certain time existed at some
> point. As such, it doesn't really mean that the node is 'down' or not.
> It conveys no solid information about the node being down unless the
> node explicitly said that it went down (which I don't believe that it
> does anyways).
>
> I'm not sure what difference a 'DOWN' message would make if it were to
> come after a timeout. It certainly wouldn't have much more of an effect
> than just increasing the tick time via net_kernel by the same amount.
>
> On Jun 6, 2010, at 10:23 PM, Scott Lystig Fritchie wrote:
>
>> Johan Montelius <johanmon@REDACTED> wrote:
>>
>> jw> Ok, so it's a global parameter for all monitors. From a application
>> jw> level point of view it could be an advantage to have this as a per
>> jw> monitor value.
>>
>> Hrm, I don't know if that's really feasible. It's the net_kernel's job
>> to keep track of inter-node timeouts. The timeout used is the same for
>> all connections to all other nodes. (You can have lots of "fun" with
>> asymmetric timeout behavior by using different kernel net_ticktime
>> values for different nodes, for very small values of "fun".)
>>
>> Assuming that you could have different timeout values between nodes,
>> once a net_kernel connection between two nodes is interrupted, the net
>> kernel (with help from the VM, IIRC) will immediately deliver all
>> monitor DOWN events.
>>
>> Delaying the delivery of those {'DOWN', ...} events doesn't seem to me
>> to
>> have much useful value. If the TCP(*) connection between node A & B is
>> broken, but you delay {'DOWN', ...} events from being delivered on A
>> ... then some process on node A could happily assume that it could send
>> messages to B when it almost certainly cannot.
>>
>> If "delay of DOWN event delivery" means being more cautious about
>> whether or not a network partition has happened or if the remote node
>> really did crash, you can't get that distinction from the net_kernel.
>> You have to roll your own ... and prompt delivery of DOWN events is
>> likely your best choice in that case, also.
>>
>> -Scott
>>
>> (*) Unless you're using a custom distribution protocol such as SCTP
>> (I've never used it but it's alleged to exist?), it's either an
>> unencrypted TCP connection or an SSL-encrypted TCP connection.
>>
>> ________________________________________________________________
>> erlang-questions (at) erlang.org mailing list.
>> See http://www.erlang.org/faq.html
>> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>>
>
--
Associate Professor Johan Montelius
Royal Institute of Technology - KTH
School of Information and Communication Technology - ICT
More information about the erlang-questions
mailing list