[erlang-questions] Distributed application and netsplit

Tue Nov 18 22:38:34 CET 2014

It's possible that the takeover logic wasn't implemented on network
partition heal because there's no obviously right thing to do in the
generic case when two nodes believing themselves to be masters in a
distributed system discover that they have independently been making
progress owing to a network partition.  On the other hand, the master/slave
system creates data loss anyway on node failure.  So I would personally
call what you have found a bug, and try to make a minimum example case and
see if anyone from OTP is paying attention.

Pragmatically if relatively hackily, it would be a few hours' work to
implement a watchdog gen_server that calls nodes(), and on any changes,
broadcasts info messages to all affected parties for appropriate
resolution.  You could even try to deal with reconciliation between a slave
and a master in that scenario if you wanted to be extra clever.

On Tue, Nov 18, 2014 at 1:23 PM, Karolis Petrauskas <k.petrauskas@REDACTED>
wrote:

> I have read LYSE and its section on distributed applications in
> particular. I have configured my nodes so, that I can determine, which
> node is more important than other. The problem is how to stop the
> application on the less important node. This problem only occurs after
> recovery from netsplit. Ferd's page
> (http://learnyousomeerlang.com/distributed-otp-applications) covers
> netsplits only by the following note:
>
> Note: In terms of distributed programming fallacies, distributed OTP
> applications assume that when there is a failure, it is likely due to
> a hardware failure, and not a netsplit. If you deem netsplits more
> likely than hardware failures, then you have to be aware of the
> possibility that the application is running both as a backup and main
> one, and that funny things could happen when the network issue is
> resolved. Maybe distributed OTP applications aren't the right
> mechanism for you in these cases.
>
> Now I have those "funny things".
>
> Karolis
>
> On Tue, Nov 18, 2014 at 11:04 PM, Felix Gallo <felixgallo@REDACTED>
> wrote:
> > No, it's a specific question having to do with the failover/takeover
> > mechanisms.
> >
> > If I'm understanding the problem correctly, ferd's page on the topic may
> be
> > handy -- specifically the bottom part talks about how to configure
> different
> > nodes to recognize themselves as being less important than others.
> >
> > http://learnyousomeerlang.com/distributed-otp-applications
> >
> > On Tue, Nov 18, 2014 at 1:00 PM, Raoul Duke <raould@REDACTED> wrote:
> >>
> >> hello? isn't this like a canonical systems engineer question?!
> >>
> >>
> >>
> https://www.google.com/search?q=split+brain+interview+question+networking+storage
> >> _______________________________________________
> >> erlang-questions mailing list
> >> erlang-questions@REDACTED
> >> http://erlang.org/mailman/listinfo/erlang-questions
> >
> >
> >
> > _______________________________________________
> > erlang-questions mailing list
> > erlang-questions@REDACTED
> > http://erlang.org/mailman/listinfo/erlang-questions
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20141118/01ad72e9/attachment.htm>