[erlang-questions] Issue with failover/takeover

Ulf Wiger ulf@REDACTED
Fri Oct 10 22:54:29 CEST 2014

On 10 Oct 2014, at 04:28, Akash Chowdhury <achowdhury918@REDACTED> wrote:

> =ERROR REPORT==== ...>>> ** Node<secondary node> not responding **>>> ** Removing (timedout) connection **>>>

This message doesn’t in itself positively indicate a netsplit, but since you observe the app running on both nodes, this is what has happened.

Specifically, the secondary node failed to send pings (or the pings got stuck/lost along the way) for four consecutive ping intervals. The length of the ping interval is set using:

-kernel net_ticktime Seconds

Default is 60 seconds. I would not set it to a value less than 10 seconds, except after rigorous testing under realistic conditions.

There could be many reasons for the connection timing out, but the Erlang VM is usually not that unresponsive, if it has sufficient resources. You should investigate whether anything in your local environment blocks the VM schedulers or otherwise renders the VM unresponsive.

If the network is flaky, or you’re pushing a lot of data, you might want to play with the

+zdbbl size

flag, which controls the buffer size for outgoing messages on Distributed Erlang. If this buffer becomes full, the VM will suspen any process that tries to send a message, and if this process is the ‘tick’ process, you have a problem. See the ‘erl’ man page for details.

The problem of determining whether to start the application or not when you get a ‘nodedown’ indication is a hard one. Ideally, you should designate a third party: if you can talk to the third party, and it tells you you’re ok, you’re good to go. If the third party is able to deconflict between two nodes pinging it at the same time, great.

A particular Erlang flavor of this problem is that the nodes may auto-connect again, which may, or may not, actually make things worse. One way to deal with that is to set -kernel dist_auto_connect once, which means that a node will only auto-connect with another node once in its life time; it needs to restart before it can auto-connect again. If you combine this with e.g. a UDP ping between the nodes, you can detect whether the nodes have lost contact, but are both still alive. If this occurs, you can decide (e.g. using the UDP messages) to restart one of the nodes.

Ulf W

Ulf Wiger, Co-founder & Developer Advocate, Feuerlabs Inc.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20141010/da639e5d/attachment.htm>

More information about the erlang-questions mailing list