[erlang-questions] What causes nodes to become disconnected/reconnected?

Martynas Pumputis martynasp@REDACTED
Fri May 25 11:24:07 CEST 2012


Erlang doesn't detect net splits by itself. You could start looking at
net_kernel:set_net_ticktime/2 (try to increase this value if your node is
suffering from high load/traffic) and kernel/src/dist_utils.erl to get a
grasp how erlang handles node connections.

Martynas

On Thu, May 24, 2012 at 6:23 PM, David Mercer <dmercer@REDACTED> wrote:

> (Yes, I am still working on my issue with a distributed application that
> keeps losing its connection to the other node.)****
>
> ** **
>
> It is not a problem only when I have 2 nodes on the *same* host.  I was
> running the nodes on *different* hosts last night, and this morning the
> failover node had lost its connection to the main and so had started its
> own instance of the application.  Calling nodes() on the failover
> returned [].****
>
> ** **
>
> Then I started a new node on the same host as the main (to see if it would
> restore the connections), and, yes, it did.  After starting the third node, nodes()
> now on the failover node returns a list of two nodes, the two on the main
> host.  However, the application on the failover node did not shut down, and
> so it is still running on both the main and failover nodes.****
>
> ** **
>
> To summarize:****
>
> ** **
>
> **1.       **Distributed application running on a node on host A (“main@REDACTED”),
> failover on node on host B (“failover@REDACTED”).****
>
> ** **
>
> **2.       **At some point, *failover@REDACTED* becomes disconnected from *main@REDACTED
> *, and the application starts on *failover@REDACTED*.  Now there are two
> instances of the application running.****
>
> ** **
>
> **3.       **From a network point of view, there is still (or again) a
> valid network connection between hosts A and B.  I can’t say for sure if
> some network/firewall/other issue caused a temporary disconnect, but I can
> say that by the time I got in this morning, when the application was
> running on both nodes, there was a firm network connection between the two
> hosts.****
>
> ** **
>
> **4.       **Calling nodes() on *failover@REDACTED* returns [].****
>
> ** **
>
> **5.       **A new (failover) node was started on host A (“failover@REDACTED”).
> It does not start the application (which is correct, since it is already
> running, albeit on *both* other nodes instead of just one).****
>
> ** **
>
> **6.       **Calling nodes() on *failover@REDACTED* now returns [failover@REDACTED
> ,main@REDACTED].****
>
> ** **
>
> **7.       **The application, however, is still running on *failover@REDACTED*,
> despite the fact that nodes/0 reports a connection to *main@REDACTED*.****
>
> ** **
>
> I don’t need someone to diagnose this for me.  If someone could just
> educate me a little on how the connections work, how net splits are
> detected and nodes disconnected, etc., I might be able to take it from
> there.  Anyone know enough and have the time to type out a little blurb?  I
> can read source code, but having a little background knowledge would help
> put it into context for me.****
>
> ** **
>
> Thank-you!****
>
> ** **
>
> Cheers,****
>
> ** **
>
> David****
>
> ** **
>
> ** **
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20120525/44cf481c/attachment.htm>


More information about the erlang-questions mailing list