[erlang-questions] monitor and failure detectors

Johan Montelius johanmon@REDACTED
Mon Jun 7 10:09:14 CEST 2010


A DOWN message does not give you very much information about what happened  
but it might be that no DOWN message does give you a lot of information.

I would like if the semantics of Erlang where such that:

  If a sequence of messages is potentially lost from A to B and B is  
monitoring A the B will receive a DOWN-noconnection message. The DOWN  
message is delivered to B after the last message that was reliably  
delivered and before any message sent after the potentially lost sequence  
of messages.

If this is the case then it does mean a lot. As long as B does not  
receive  DOWN-noconnection message it can live in a world where messages  
are reliably delivered from A to B.

Is this the semantics of Erlang today?


  Johan






On Mon, 07 Jun 2010 09:10:56 +0200, Jayson Vantuyl <kagato@REDACTED>  
wrote:

> In any case, there is a race condition between when the connection dies  
> and the message is handled.  Reading too much into the message is  
> trouble.
>
> It's what I like to call a "volatile" message.  It means only that a  
> failure detector kicked in, and it doesn't actually contribute much  
> information other than that a delay of a certain time existed at some  
> point.  As such, it doesn't really mean that the node is 'down' or not.   
> It conveys no solid information about the node being down unless the  
> node explicitly said that it went down (which I don't believe that it  
> does anyways).
>
> I'm not sure what difference a 'DOWN' message would make if it were to  
> come after a timeout.  It certainly wouldn't have much more of an effect  
> than just increasing the tick time via net_kernel by the same amount.
>
> On Jun 6, 2010, at 10:23 PM, Scott Lystig Fritchie wrote:
>
>> Johan Montelius <johanmon@REDACTED> wrote:
>>
>> jw> Ok, so it's a global parameter for all monitors. From a application
>> jw> level point of view it could be an advantage to have this as a per
>> jw> monitor value.
>>
>> Hrm, I don't know if that's really feasible.  It's the net_kernel's job
>> to keep track of inter-node timeouts.  The timeout used is the same for
>> all connections to all other nodes.  (You can have lots of "fun" with
>> asymmetric timeout behavior by using different kernel net_ticktime
>> values for different nodes, for very small values of "fun".)
>>
>> Assuming that you could have different timeout values between nodes,
>> once a net_kernel connection between two nodes is interrupted, the net
>> kernel (with help from the VM, IIRC) will immediately deliver all
>> monitor DOWN events.
>>
>> Delaying the delivery of those {'DOWN', ...} events doesn't seem to me  
>> to
>> have much useful value.  If the TCP(*) connection between node A & B is
>> broken, but you delay {'DOWN', ...} events from being delivered on A
>> ... then some process on node A could happily assume that it could send
>> messages to B when it almost certainly cannot.
>>
>> If "delay of DOWN event delivery" means being more cautious about
>> whether or not a network partition has happened or if the remote node
>> really did crash, you can't get that distinction from the net_kernel.
>> You have to roll your own ... and prompt delivery of DOWN events is
>> likely your best choice in that case, also.
>>
>> -Scott
>>
>> (*) Unless you're using a custom distribution protocol such as SCTP
>> (I've never used it but it's alleged to exist?), it's either an
>> unencrypted TCP connection or an SSL-encrypted TCP connection.
>>
>> ________________________________________________________________
>> erlang-questions (at) erlang.org mailing list.
>> See http://www.erlang.org/faq.html
>> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>>
>


-- 
Associate Professor Johan Montelius
Royal Institute of Technology - KTH
School of Information and Communication Technology - ICT


More information about the erlang-questions mailing list