[erlang-questions] monitor and failure detectors

Mon Jun 7 09:10:56 CEST 2010

In any case, there is a race condition between when the connection dies and the message is handled.  Reading too much into the message is trouble.

It's what I like to call a "volatile" message.  It means only that a failure detector kicked in, and it doesn't actually contribute much information other than that a delay of a certain time existed at some point.  As such, it doesn't really mean that the node is 'down' or not.  It conveys no solid information about the node being down unless the node explicitly said that it went down (which I don't believe that it does anyways).

I'm not sure what difference a 'DOWN' message would make if it were to come after a timeout.  It certainly wouldn't have much more of an effect than just increasing the tick time via net_kernel by the same amount.

On Jun 6, 2010, at 10:23 PM, Scott Lystig Fritchie wrote:

> Johan Montelius <johanmon@REDACTED> wrote:
> 
> jw> Ok, so it's a global parameter for all monitors. From a application
> jw> level point of view it could be an advantage to have this as a per
> jw> monitor value.
> 
> Hrm, I don't know if that's really feasible.  It's the net_kernel's job
> to keep track of inter-node timeouts.  The timeout used is the same for
> all connections to all other nodes.  (You can have lots of "fun" with
> asymmetric timeout behavior by using different kernel net_ticktime
> values for different nodes, for very small values of "fun".)
> 
> Assuming that you could have different timeout values between nodes,
> once a net_kernel connection between two nodes is interrupted, the net
> kernel (with help from the VM, IIRC) will immediately deliver all
> monitor DOWN events.
> 
> Delaying the delivery of those {'DOWN', ...} events doesn't seem to me to
> have much useful value.  If the TCP(*) connection between node A & B is
> broken, but you delay {'DOWN', ...} events from being delivered on A
> ... then some process on node A could happily assume that it could send
> messages to B when it almost certainly cannot.
> 
> If "delay of DOWN event delivery" means being more cautious about
> whether or not a network partition has happened or if the remote node
> really did crash, you can't get that distinction from the net_kernel.
> You have to roll your own ... and prompt delivery of DOWN events is
> likely your best choice in that case, also.
> 
> -Scott
> 
> (*) Unless you're using a custom distribution protocol such as SCTP
> (I've never used it but it's alleged to exist?), it's either an
> unencrypted TCP connection or an SSL-encrypted TCP connection.
> 
> ________________________________________________________________
> erlang-questions (at) erlang.org mailing list.
> See http://www.erlang.org/faq.html
> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
> 

-- 
Jayson Vantuyl
417-207-6962 (mobile)
kagato@REDACTED