[erlang-questions] Issue with failover/takeover

T Ty tty.erlang@REDACTED
Fri Oct 10 14:09:09 CEST 2014


Here is a possible non-code solution, network bond your server to server
link with a dose of multihoming end-points would deal with some of this.



On Fri, Oct 10, 2014 at 11:18 AM, Darach Ennis <darach@REDACTED> wrote:

> Hi guys,
>
> To my knowledge there is no obvious way to tune the sensitivity of the
> failover
> detector in distributed erlang or replace the strategy used to determine
> when a
> node is considered down with distributed erlang, and hence when a takeover
> may occur or should not occur.
>
> If the strategy is based on timeouts then something as innocuous as a page
> fault could trigger a transparent huge page fault which in turn will
> trigger a
> synchronous compaction, all becuase of a little GC or memory churn in
> your IO intensive application. That can manifest as a timeout, which it
> seems
> could hit you as a distributed erlang netsplit.
>
> You could turn off transparent huge pages. But netsplit detection is as
> good
> as the failure detection strategy, the tuning and the qualify of the
> environment
> its running in... So your mileage will vary.
>
> You can argue successfully many things at this point:
> * A timeout occurred, its a netsplit, deal with it
>   -> Hanging by a longer rope is still hanging...
> * The timeout is set too keen, widen it and try to avoid THP induced
> issues by disabling it.
>   -> Cures the cold of a man condemned to death by hanging
> * The failure detector is broken.
>   -> Can't tune it. Can't replace it. We have a problem.
>
> So, Akash has a very valid concern.
>
> Perhaps one of the gen_leader revival projects would work well here:
>
> https://github.com/KirinDave/gen_leader_revival
>
> Or even riak_core if a masterless strategy is preferential:
>
> https://github.com/basho/riak_core
>
> If you need active/active hot/hot or hot/warm or active/passive hot/cold
> standby then variations on the OpenCall SS7
> Fault Tolerance Controller has been my starting point for HA pairs for
> years (careful to avoid the patented bits):
>
> http://www.hpl.hp.com/hpjournal/97aug/aug97a8.pdf
>
> Recent adaptive failure detectors such as phi-accrual in akka clustering:
>
> http://ddg.jaist.ac.jp/pub/HDY+04.pdf
>
>
> http://letitcrash.com/post/43480488964/the-new-cluster-metrics-aware-adaptive-load-balancing
>
> In a nutshell, you don't need to live with bad (not tunable) or badly
> tuned failure detectors.
>
> But, unfortunately, you won't find a perfect fit. There isn't any. HA is
> simply hard. And even
> carrier grade and battle hardened algorithms such as SS7 need considerable
> tuning to work
> well in a real environment.
>
> Back to Akash's original issue. I would have concerns with using
> distributed erlang
> in a production setting. On the one hand, the failover/takeover model is
> simple and
> attractive. The fact that TCP can be swapped out for another transport in
> ERTS to
> drive distributed erlang differently... well that seems like a lot of work
> for plugging in
> an alternate failure detector...
>
> Cheers,
>
> Darach.
>
> On Fri, Oct 10, 2014 at 7:48 AM, Graham Hay <grahamrhay@REDACTED> wrote:
>
>> http://kellabyte.com/2013/11/04/the-network-partitions-are-rare-fallacy/
>>
>> On 10 October 2014 06:52, Dmitry Kolesnikov <dmkolesnikov@REDACTED>
>> wrote:
>>
>>> Hello,
>>>
>>> The error message explicitly says that  Erlang distribution experience
>>> network split. There might be many reasons for that. Hard to say w/o
>>> knowing your env.
>>>
>>> Best Regards,
>>> Dmitry >-|-|-(*>
>>>
>>>
>>> On 10.10.2014, at 5.28, Akash Chowdhury <achowdhury918@REDACTED> wrote:
>>>
>>> I am using failover/takeover feature of distributed erlang. I have
>>> primary and secondary node in a group. Most of the times, my app is running
>>> on the primary node and secondary node is in-active. But sometimes, I am
>>> seeing that my app is running on both nodes simultaneously which is not
>>> expected behavior. I know this can happen when there is a netsplit (network
>>> disconnection) between two nodes. But that didn't happen in my case. From
>>> system stats, it was confirmed that network connection was intact. What can
>>> be other causes for this? I see the following error message in primary node
>>> log when this issue happened :
>>>
>>> *=ERROR REPORT==== ...*>>>* ** Node<secondary node> not responding ***
>>> >>>* ** Removing (timedout) connection ***>>>
>>>
>>> Any information/help regarding this will be highly appreciated.
>>>
>>> Thanks.
>>>
>>> _______________________________________________
>>> erlang-questions mailing list
>>> erlang-questions@REDACTED
>>> http://erlang.org/mailman/listinfo/erlang-questions
>>>
>>>
>>> _______________________________________________
>>> erlang-questions mailing list
>>> erlang-questions@REDACTED
>>> http://erlang.org/mailman/listinfo/erlang-questions
>>>
>>>
>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
>>
>>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20141010/db6b3dbe/attachment.htm>


More information about the erlang-questions mailing list