[erlang-questions] Monitor process on remote node

Fri Jun 5 13:21:10 CEST 2015

On 2015年6月5日 金曜日 13:53:40 Alex Gunin wrote:
> We have two process P1 on node N1 and P2 on node N2.
> P1 is monitoring  P2 and P2 is monitoring P1.
> Is’t possible after some network failures/problems  that P1 receives {‘DOWN’,…} message,but P2 does’t. Both process life all this time.

>From the perspective of either P1 or P2 there is no difference between a network failure and a process crash: either process becoming unavailable for whatever reason is a failure that generates a 'DOWN' message.

There may be some amazing edge case where the gap between N1 and N2 recognizing the network problem is significant, but that is part of why synchronous messaging is built on top of asynchronous messaging (which is true of both OTP 'call' and TCP).

Now that I've mentioned TCP, are there cases where one side of the connection doesn't recognize that the connection on the other end has been dropped? Of course -- temporarily. There are brief periods of this that must exist in distributed Erlang as well, but that's part of what timeouts are for.

In any case, the runtime's support for monitors and links has been so robust that I've never encountered a situation where a node dropping off introduced unexpected hangs in my system.  That is, unless I have let some network fallacies creep into my code... especially when I initially code assuming everything is running within a single node and start cheating here and there for performance. (So far that has always turned out is optimization I never needed anyway! Ugh!)

It would be very interesting to learn about the exact mechanism underlying Erlang's distributed monitor/link functionality -- but at the moment I'm too busy trying to solve customer problems to care beyond the fact that it works in a remarkably reliable way.

-Craig