[erlang-questions] What causes nodes to become disconnected/reconnected?
Martynas Pumputis
martynasp@REDACTED
Fri May 25 11:24:07 CEST 2012
Erlang doesn't detect net splits by itself. You could start looking at
net_kernel:set_net_ticktime/2 (try to increase this value if your node is
suffering from high load/traffic) and kernel/src/dist_utils.erl to get a
grasp how erlang handles node connections.
Martynas
On Thu, May 24, 2012 at 6:23 PM, David Mercer <dmercer@REDACTED> wrote:
> (Yes, I am still working on my issue with a distributed application that
> keeps losing its connection to the other node.)****
>
> ** **
>
> It is not a problem only when I have 2 nodes on the *same* host. I was
> running the nodes on *different* hosts last night, and this morning the
> failover node had lost its connection to the main and so had started its
> own instance of the application. Calling nodes() on the failover
> returned [].****
>
> ** **
>
> Then I started a new node on the same host as the main (to see if it would
> restore the connections), and, yes, it did. After starting the third node, nodes()
> now on the failover node returns a list of two nodes, the two on the main
> host. However, the application on the failover node did not shut down, and
> so it is still running on both the main and failover nodes.****
>
> ** **
>
> To summarize:****
>
> ** **
>
> **1. **Distributed application running on a node on host A (“main@REDACTED”),
> failover on node on host B (“failover@REDACTED”).****
>
> ** **
>
> **2. **At some point, *failover@REDACTED* becomes disconnected from *main@REDACTED
> *, and the application starts on *failover@REDACTED*. Now there are two
> instances of the application running.****
>
> ** **
>
> **3. **From a network point of view, there is still (or again) a
> valid network connection between hosts A and B. I can’t say for sure if
> some network/firewall/other issue caused a temporary disconnect, but I can
> say that by the time I got in this morning, when the application was
> running on both nodes, there was a firm network connection between the two
> hosts.****
>
> ** **
>
> **4. **Calling nodes() on *failover@REDACTED* returns [].****
>
> ** **
>
> **5. **A new (failover) node was started on host A (“failover@REDACTED”).
> It does not start the application (which is correct, since it is already
> running, albeit on *both* other nodes instead of just one).****
>
> ** **
>
> **6. **Calling nodes() on *failover@REDACTED* now returns [failover@REDACTED
> ,main@REDACTED].****
>
> ** **
>
> **7. **The application, however, is still running on *failover@REDACTED*,
> despite the fact that nodes/0 reports a connection to *main@REDACTED*.****
>
> ** **
>
> I don’t need someone to diagnose this for me. If someone could just
> educate me a little on how the connections work, how net splits are
> detected and nodes disconnected, etc., I might be able to take it from
> there. Anyone know enough and have the time to type out a little blurb? I
> can read source code, but having a little background knowledge would help
> put it into context for me.****
>
> ** **
>
> Thank-you!****
>
> ** **
>
> Cheers,****
>
> ** **
>
> David****
>
> ** **
>
> ** **
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20120525/44cf481c/attachment.htm>
More information about the erlang-questions
mailing list