[erlang-questions] monitor and failure detectors

Mon Jun 7 07:23:15 CEST 2010

Johan Montelius <johanmon@REDACTED> wrote:

jw> Ok, so it's a global parameter for all monitors. From a application
jw> level point of view it could be an advantage to have this as a per
jw> monitor value.

Hrm, I don't know if that's really feasible.  It's the net_kernel's job
to keep track of inter-node timeouts.  The timeout used is the same for
all connections to all other nodes.  (You can have lots of "fun" with
asymmetric timeout behavior by using different kernel net_ticktime
values for different nodes, for very small values of "fun".)

Assuming that you could have different timeout values between nodes,
once a net_kernel connection between two nodes is interrupted, the net
kernel (with help from the VM, IIRC) will immediately deliver all
monitor DOWN events.

Delaying the delivery of those {'DOWN', ...} events doesn't seem to me to
have much useful value.  If the TCP(*) connection between node A & B is
broken, but you delay {'DOWN', ...} events from being delivered on A
... then some process on node A could happily assume that it could send
messages to B when it almost certainly cannot.

If "delay of DOWN event delivery" means being more cautious about
whether or not a network partition has happened or if the remote node
really did crash, you can't get that distinction from the net_kernel.
You have to roll your own ... and prompt delivery of DOWN events is
likely your best choice in that case, also.

-Scott

(*) Unless you're using a custom distribution protocol such as SCTP
(I've never used it but it's alleged to exist?), it's either an
unencrypted TCP connection or an SSL-encrypted TCP connection.