[erlang-questions] What causes nodes to become disconnected/reconnected?

David Mercer dmercer@REDACTED
Thu May 24 18:23:49 CEST 2012


(Yes, I am still working on my issue with a distributed application that
keeps losing its connection to the other node.)

 

It is not a problem only when I have 2 nodes on the same host.  I was
running the nodes on different hosts last night, and this morning the
failover node had lost its connection to the main and so had started its own
instance of the application.  Calling nodes() on the failover returned [].

 

Then I started a new node on the same host as the main (to see if it would
restore the connections), and, yes, it did.  After starting the third node,
nodes() now on the failover node returns a list of two nodes, the two on the
main host.  However, the application on the failover node did not shut down,
and so it is still running on both the main and failover nodes.

 

To summarize:

 

1.       Distributed application running on a node on host A ("main@REDACTED"),
failover on node on host B ("failover@REDACTED").

 

2.       At some point, failover@REDACTED becomes disconnected from main@REDACTED, and the
application starts on failover@REDACTED  Now there are two instances of the
application running.

 

3.       From a network point of view, there is still (or again) a valid
network connection between hosts A and B.  I can't say for sure if some
network/firewall/other issue caused a temporary disconnect, but I can say
that by the time I got in this morning, when the application was running on
both nodes, there was a firm network connection between the two hosts.

 

4.       Calling nodes() on failover@REDACTED returns [].

 

5.       A new (failover) node was started on host A ("failover@REDACTED").  It
does not start the application (which is correct, since it is already
running, albeit on both other nodes instead of just one).

 

6.       Calling nodes() on failover@REDACTED now returns [failover@REDACTED,main@REDACTED].

 

7.       The application, however, is still running on failover@REDACTED, despite
the fact that nodes/0 reports a connection to main@REDACTED

 

I don't need someone to diagnose this for me.  If someone could just educate
me a little on how the connections work, how net splits are detected and
nodes disconnected, etc., I might be able to take it from there.  Anyone
know enough and have the time to type out a little blurb?  I can read source
code, but having a little background knowledge would help put it into
context for me.

 

Thank-you!

 

Cheers,

 

David

 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20120524/f2b3d84c/attachment.htm>


More information about the erlang-questions mailing list