[erlang-questions] Issue with failover/takeover

Fri Oct 10 23:20:42 CEST 2014

Hi Ulf,

Good to know the effect of +zddbl. I'm not sure a 3rd party will help
deconflict in all cases though. A good example is the Ctrl-Z test.
Suspending
the Erlang VM (or any OS level process) for a reasonable period, say 5 or
15 seconds,
then bringing it back to the foreground (eg: fg on linux) whilst under a
reasonable production load
is a great way to find issues here.

Many failure detectors in commercial and open source projects fail this
simple test
and it sounds like distributed erlang, through actively suspending the tick
process is
also less than ideal, but with careful tuning to load it sounds reasonably
manageable
in practice.

Cheers,

Darach.

On Fri, Oct 10, 2014 at 9:54 PM, Ulf Wiger <ulf@REDACTED> wrote:

>
> On 10 Oct 2014, at 04:28, Akash Chowdhury <achowdhury918@REDACTED> wrote:
>
>
> *=ERROR REPORT==== ...*>>>* ** Node<secondary node> not responding ***>>>* **
> Removing (timedout) connection ***>>>
>
>
> This message doesn’t in itself positively indicate a netsplit, but since
> you observe the app running on both nodes, this is what has happened.
>
> Specifically, the secondary node failed to send pings (or the pings got
> stuck/lost along the way) for four consecutive ping intervals. The length
> of the ping interval is set using:
>
> -kernel net_ticktime Seconds
>
> Default is 60 seconds. I would not set it to a value less than 10 seconds,
> except after rigorous testing under realistic conditions.
>
> There could be many reasons for the connection timing out, but the Erlang
> VM is usually not that unresponsive, if it has sufficient resources. You
> should investigate whether anything in your local environment blocks the VM
> schedulers or otherwise renders the VM unresponsive.
>
> If the network is flaky, or you’re pushing a lot of data, you might want
> to play with the
>
> +zdbbl size
>
> flag, which controls the buffer size for outgoing messages on Distributed
> Erlang. If this buffer becomes full, the VM will suspen any process that
> tries to send a message, and if this process is the ‘tick’ process, you
> have a problem. See the ‘erl’ man page for details.
>
> The problem of determining whether to start the application or not when
> you get a ‘nodedown’ indication is a hard one. Ideally, you should
> designate a third party: if you can talk to the third party, and it tells
> you you’re ok, you’re good to go. If the third party is able to deconflict
> between two nodes pinging it at the same time, great.
>
> A particular Erlang flavor of this problem is that the nodes may
> auto-connect again, which may, or may not, actually make things worse. One
> way to deal with that is to set -kernel dist_auto_connect once, which means
> that a node will only auto-connect with another node once in its life time;
> it needs to restart before it can auto-connect again. If you combine this
> with e.g. a UDP ping between the nodes, you can detect whether the nodes
> have lost contact, but are both still alive. If this occurs, you can decide
> (e.g. using the UDP messages) to restart one of the nodes.
>
> BR,
> Ulf W
>
> Ulf Wiger, Co-founder & Developer Advocate, Feuerlabs Inc.
> http://feuerlabs.com
>
>
>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20141010/68fea0fb/attachment.htm>