[erlang-questions] Split brain in disributed Erlang?

Tom Samplonius tom@REDACTED
Sun Apr 29 05:54:00 CEST 2007


----- "Jim Larson" <jim@REDACTED> wrote:
> The first thing to do when thinking about "split brain" issues
> is to ask yourself if you really need to handle the condition,
> and if so, what sort of service do you need to provide during
> the outage.  If your goal is just to keep availability as high
> as possible during the split, a quorum-based election system
> can handle failover and keep your application running - for the
> majority fragment of the split.

  Yes, the network infrastructure between nodes could fail in strange ways.  But not strange enough, that their havn't been some recorded incidents.  Usually they are short-lived, like a STP reconvergence (45 to 90 seconds, longer STP is a bit pathological).

> Of course, it's possible that you could have a multi-way split
> where no single fragment has a majority.  Even if the standard
> Erlang distribution infrastructure can provide service during
> the split and recover afterward, you've got a tough job of deciding
> what to do at the higher level, independent of whatever your
> communication infrastructure is.

  Yes, a multiway split is possible, but for my application, it is better for the nodes to stop processing and wait for a quorum to return, rather than risk processing the same message twice.

> When you've got a multi-way, all-minority split, you can make one
> of the following tough choices:
> 
> 	- You can just reason that this is an unlikely circumstance
> 	and accept an outage in this situation.
> 
> 	- You can provide a degraded level of service, disallowing
> 	some but not all operations, e.g. bringing nodes in or out,
> 	creating new top-level entities of whatever sort, etc.
> 
> 	- Lastly, you could relax consistency and let the fragments
> 	continue to function independently without coordination,
> 	and attempt to repair inconsistencies when the split
> 	brain heals.

  Yes, these are all good points.  I also wonder how Mnesia copes, if two nodes with the same table partition and resume.  The docs are not clear, if updates that were performed during the partition are integrated.  Do anyone know more about Mnesia in split-brain scenarios?

> Good luck!
> 
> Jim Larson

Tom



More information about the erlang-questions mailing list