[erlang-questions] Issue with failover/takeover
Fri Oct 10 12:18:38 CEST 2014
To my knowledge there is no obvious way to tune the sensitivity of the
detector in distributed erlang or replace the strategy used to determine
node is considered down with distributed erlang, and hence when a takeover
may occur or should not occur.
If the strategy is based on timeouts then something as innocuous as a page
fault could trigger a transparent huge page fault which in turn will
synchronous compaction, all becuase of a little GC or memory churn in
your IO intensive application. That can manifest as a timeout, which it
could hit you as a distributed erlang netsplit.
You could turn off transparent huge pages. But netsplit detection is as good
as the failure detection strategy, the tuning and the qualify of the
its running in... So your mileage will vary.
You can argue successfully many things at this point:
* A timeout occurred, its a netsplit, deal with it
-> Hanging by a longer rope is still hanging...
* The timeout is set too keen, widen it and try to avoid THP induced issues
by disabling it.
-> Cures the cold of a man condemned to death by hanging
* The failure detector is broken.
-> Can't tune it. Can't replace it. We have a problem.
So, Akash has a very valid concern.
Perhaps one of the gen_leader revival projects would work well here:
Or even riak_core if a masterless strategy is preferential:
If you need active/active hot/hot or hot/warm or active/passive hot/cold
standby then variations on the OpenCall SS7
Fault Tolerance Controller has been my starting point for HA pairs for
years (careful to avoid the patented bits):
Recent adaptive failure detectors such as phi-accrual in akka clustering:
In a nutshell, you don't need to live with bad (not tunable) or badly tuned
But, unfortunately, you won't find a perfect fit. There isn't any. HA is
simply hard. And even
carrier grade and battle hardened algorithms such as SS7 need considerable
tuning to work
well in a real environment.
Back to Akash's original issue. I would have concerns with using
in a production setting. On the one hand, the failover/takeover model is
attractive. The fact that TCP can be swapped out for another transport in
drive distributed erlang differently... well that seems like a lot of work
for plugging in
an alternate failure detector...
On Fri, Oct 10, 2014 at 7:48 AM, Graham Hay <grahamrhay@REDACTED> wrote:
> On 10 October 2014 06:52, Dmitry Kolesnikov <dmkolesnikov@REDACTED>
>> The error message explicitly says that Erlang distribution experience
>> network split. There might be many reasons for that. Hard to say w/o
>> knowing your env.
>> Best Regards,
>> Dmitry >-|-|-(*>
>> On 10.10.2014, at 5.28, Akash Chowdhury <achowdhury918@REDACTED> wrote:
>> I am using failover/takeover feature of distributed erlang. I have
>> primary and secondary node in a group. Most of the times, my app is running
>> on the primary node and secondary node is in-active. But sometimes, I am
>> seeing that my app is running on both nodes simultaneously which is not
>> expected behavior. I know this can happen when there is a netsplit (network
>> disconnection) between two nodes. But that didn't happen in my case. From
>> system stats, it was confirmed that network connection was intact. What can
>> be other causes for this? I see the following error message in primary node
>> log when this issue happened :
>> *=ERROR REPORT==== ...*>>>* ** Node<secondary node> not responding ***>>>*
>> ** Removing (timedout) connection ***>>>
>> Any information/help regarding this will be highly appreciated.
>> erlang-questions mailing list
>> erlang-questions mailing list
> erlang-questions mailing list
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the erlang-questions