[erlang-questions] Issue with failover/takeover

Fri Oct 10 12:18:38 CEST 2014

Hi guys,

To my knowledge there is no obvious way to tune the sensitivity of the
failover
detector in distributed erlang or replace the strategy used to determine
when a
node is considered down with distributed erlang, and hence when a takeover
may occur or should not occur.

If the strategy is based on timeouts then something as innocuous as a page
fault could trigger a transparent huge page fault which in turn will
trigger a
synchronous compaction, all becuase of a little GC or memory churn in
your IO intensive application. That can manifest as a timeout, which it
seems
could hit you as a distributed erlang netsplit.

You could turn off transparent huge pages. But netsplit detection is as good
as the failure detection strategy, the tuning and the qualify of the
environment
its running in... So your mileage will vary.

You can argue successfully many things at this point:
* A timeout occurred, its a netsplit, deal with it
  -> Hanging by a longer rope is still hanging...
* The timeout is set too keen, widen it and try to avoid THP induced issues
by disabling it.
  -> Cures the cold of a man condemned to death by hanging
* The failure detector is broken.
  -> Can't tune it. Can't replace it. We have a problem.

So, Akash has a very valid concern.

Perhaps one of the gen_leader revival projects would work well here:

https://github.com/KirinDave/gen_leader_revival

Or even riak_core if a masterless strategy is preferential:

https://github.com/basho/riak_core

If you need active/active hot/hot or hot/warm or active/passive hot/cold
standby then variations on the OpenCall SS7
Fault Tolerance Controller has been my starting point for HA pairs for
years (careful to avoid the patented bits):

http://www.hpl.hp.com/hpjournal/97aug/aug97a8.pdf

Recent adaptive failure detectors such as phi-accrual in akka clustering:

http://ddg.jaist.ac.jp/pub/HDY+04.pdf

http://letitcrash.com/post/43480488964/the-new-cluster-metrics-aware-adaptive-load-balancing

In a nutshell, you don't need to live with bad (not tunable) or badly tuned
failure detectors.

But, unfortunately, you won't find a perfect fit. There isn't any. HA is
simply hard. And even
carrier grade and battle hardened algorithms such as SS7 need considerable
tuning to work
well in a real environment.

Back to Akash's original issue. I would have concerns with using
distributed erlang
in a production setting. On the one hand, the failover/takeover model is
simple and
attractive. The fact that TCP can be swapped out for another transport in
ERTS to
drive distributed erlang differently... well that seems like a lot of work
for plugging in
an alternate failure detector...

Cheers,

Darach.

On Fri, Oct 10, 2014 at 7:48 AM, Graham Hay <grahamrhay@REDACTED> wrote:

> http://kellabyte.com/2013/11/04/the-network-partitions-are-rare-fallacy/
>
> On 10 October 2014 06:52, Dmitry Kolesnikov <dmkolesnikov@REDACTED>
> wrote:
>
>> Hello,
>>
>> The error message explicitly says that  Erlang distribution experience
>> network split. There might be many reasons for that. Hard to say w/o
>> knowing your env.
>>
>> Best Regards,
>> Dmitry >-|-|-(*>
>>
>>
>> On 10.10.2014, at 5.28, Akash Chowdhury <achowdhury918@REDACTED> wrote:
>>
>> I am using failover/takeover feature of distributed erlang. I have
>> primary and secondary node in a group. Most of the times, my app is running
>> on the primary node and secondary node is in-active. But sometimes, I am
>> seeing that my app is running on both nodes simultaneously which is not
>> expected behavior. I know this can happen when there is a netsplit (network
>> disconnection) between two nodes. But that didn't happen in my case. From
>> system stats, it was confirmed that network connection was intact. What can
>> be other causes for this? I see the following error message in primary node
>> log when this issue happened :
>>
>> *=ERROR REPORT==== ...*>>>* ** Node<secondary node> not responding ***>>>*
>> ** Removing (timedout) connection ***>>>
>>
>> Any information/help regarding this will be highly appreciated.
>>
>> Thanks.
>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
>>
>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
>>
>>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20141010/2a64156b/attachment.htm>