[erlang-questions] RFC: mnesia majority checking

Fri Dec 10 08:44:28 CET 2010

On 9 Dec 2010, at 20:38, Morten Krogh wrote:

> Okay, but without Paxos or something similar, there will be some failure modes where the system becomes inconsistent. 

The failure mode where mnesia becomes inconsistent is split brain.
Mnesia doesn't bring any table copies online before having 
synchronised them fully with the online copies. Currently, that is done
by simply copying the entire table from one of the active nodes.

After a split-brain, mnesia will detect the condition and refuse to 
sync the tables. There are ways to reconcile, and this was the 
purpose of my 'unsplit' library (http://github.com/esl/unsplit).

The main method I have used so far with unsplit is to compare 
vector clocks. This works best for asserting that there is no inconsistency,
or merging the objects where it's clear from the vector clocks which is 
the newer object. It doesn't address the problem where reconciliation 
cannot be automatic.

The idea with majority is obviously to allow for a more controlled 
merging, instructing mnesia not to allow updates under circumstances
where subsequent split-brain merging could become undecided.

> When do you roll back? After the commit? I was talking about a failure after the commit decision. Rollback after commit doesn't make sense??

The rollback is done if there is any failure during the 'prepare'
phase of the commit. If all participants respond favourably to
the prepare request, this means that they have logged all data
needed to complete the commit and are standing by to hear if
they should.

The underlying assumption is that failures just after commit in 
mnesia will not be partial. If a node fails, it needs to come back
online by way of comparing decision logs and synchronising 
tables.

Really, I didn't invent this protocol, and I've taken great care not
to alter it, as I've assumed it is robust. It's been in mnesia for years.
It's the same protocol that's used for schema updates in mnesia,
more heavyweight and pessimistic than the 'normal' commit
protocol. If you wish to dissect the asym_trans protocol looking for
nasty corner cases, I think it's great, but then perhaps one of the 
mnesia maintainers, Dan and Håkan (still counting Håkan as such)
should join the discussion.

What I've done is make the commits even more pessimistic in the 
presence of the 'majority' flag by inserting a precondition.

BR,
Ulf W

> 
> 
> Cheers,
> 
> Morten.
> 
> 
> 
> 
> On Dec 9, 2010, at 7:47 PM, Ulf Wiger wrote:
> 
>> 
>> On 9 Dec 2010, at 19:11, Morten Krogh wrote:
>> 
>>> Hi Ulf
>>> 
>>> Did you consider using the Paxos algorithm?
>> 
>> My intention for now was not to do any major surgery to
>> the mnesia transaction handler, but rather extend the existing
>> semantics with something useful.
>> 
>> So as a first step, I wanted to add the 'majority' option, since 
>> I thought that would be a simple way to add quorum-style 
>> safety and fencing in mnesia.
>> 
>>> How do you cope with node failure after the commit process has decided to commit but before the messages have arrived at the other nodes.
>> 
>> Actually, the asym_trans commit protocol in mnesia does
>> this already. This protocol is used whenever the transaction
>> contains schema updates or asymmetric replication patterns,
>> It is more heavyweight than the 'sym_trans' protocol precisely
>> because it deals with failures in the commit phase.
>> 
>> Specifically, the way it deals with failures in the commit phase 
>> is that it rolls back the transaction.
>> 
>> BR,
>> Ulf W
>> 
>> 
>>> 
>>> Morten.
>>> 
>>> 
>>> On 12/9/10 6:25 PM, Ulf Wiger wrote:
>>>> I added majority checking in the mnesia_locker as well.
>>>> The main reason for doing so (except aborting earlier),
>>>> was to enable majority checking on reads.
>>>> 
>>>> The way it works now is that majority checking is done on
>>>> reads that use a write lock (e.g. mnesia:wread/1).
>>>> A normal read, with a read lock, will succeed even in a
>>>> minority. This is probably a pretty good thing.
>>>> 
>>>> https://github.com/uwiger/otp/commit/650f8e30d205bc1130f37c819f920f901358b937
>>>> 
>>>> Comments still most welcome. Monologues are fun too, but
>>>> I can follow Dan North's advice and get a rubber duck for that.
>>>> 
>>>> If you are unsure whether this is at all needed, please chime in.
>>>> It's is most definitely not a stupid question.
>>>> 
>>>> BR,
>>>> Ulf W
>>>> 
>>>> On 9 Dec 2010, at 15:26, Ulf Wiger wrote:
>>>> 
>>>>> git fetch git://github.com/uwiger/otp mnesia-majority
>>>>> 
>>>>> https://github.com/uwiger/otp/commit/d97ae7d4329d9342e576f3cdd893de6865449e42
>>>>> 
>>>>> This is a first stab at a function that I believe could be useful in
>>>>> high-availability applications using mnesia.
>>>>> 
>>>>> At this stage, I'd love to have some comments, and suggestions,
>>>>> if someone thinks of a better way to do it.
>>>>> 
>>>>> From the commit message:
>>>>> 
>>>>> "Add {majority, boolean()} per-table option.
>>>>> 
>>>>> With {majority, true} set for a table, write transactions will
>>>>> abort if they cannot commit to a majority of the nodes that
>>>>> have a copy of the table. Currently, the implementation hooks
>>>>> into the prepare_commit, and forces an asymmetric transaction
>>>>> if the commit set affects any table with the majority flag set.
>>>>> In the commit itself, the transaction will abort if it cannot
>>>>> satisfy the majority requirement for all tables involved in the
>>>>> thransaction.
>>>>> 
>>>>> A future optimization might be to abort already when a write
>>>>> lock is attempted on such a table (/-object) and the lock cannot
>>>>> be set on enough nodes.
>>>>> 
>>>>> This functionality introduces the possibility to automatically
>>>>> "fence off" a table in the presence of failures.
>>>>> 
>>>>> This is a first implementation. Only basic tests have been
>>>>> performed."
>>>>> 
>>>>> One particular use of this functionality would be to have  a "global
>>>>> resource pool" in one table with {majority, true}, and periodically
>>>>> check out resources into a local buffer. If there is a failure condition,
>>>>> you can use the local buffer, but not check out more resources, unless
>>>>> you happen to still be in contact with more than half of the replicas.
>>>>> 
>>>>> This should allow for a well-defined merge after a network split.
>>>>> 
>>>>> BR,
>>>>> Ulf W
>>>>> 
>>>>> Ulf Wiger, CTO, Erlang Solutions, Ltd.
>>>>> http://erlang-solutions.com
>>>>> 
>>>>> 
>>>>> 
>>>> Ulf Wiger, CTO, Erlang Solutions, Ltd.
>>>> http://erlang-solutions.com
>>>> 
>>>> 
>>>> 
>>>> 
>>>> ________________________________________________________________
>>>> erlang-questions (at) erlang.org mailing list.
>>>> See http://www.erlang.org/faq.html
>>>> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>>>> 
>>> 
>>> 
>>> ________________________________________________________________
>>> erlang-questions (at) erlang.org mailing list.
>>> See http://www.erlang.org/faq.html
>>> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>>> 
>> 
>> Ulf Wiger, CTO, Erlang Solutions, Ltd.
>> http://erlang-solutions.com
>> 
>> 
>> 
>> 
>> ________________________________________________________________
>> erlang-questions (at) erlang.org mailing list.
>> See http://www.erlang.org/faq.html
>> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>> 
> 

Ulf Wiger, CTO, Erlang Solutions, Ltd.
http://erlang-solutions.com