[erlang-questions] mnesia dirty writes & race conditions

Mon May 4 17:30:42 CEST 2015

On 05/04, Jesper Louis Andersen wrote:
>But chances are you don't need linearizability for the operation in the
>first place. And then, you can avoid having to coordinate as you may be
>able to put yourself in AP instead. AP is typically much faster due to the
>lack of coordination, but do see the work of e.g., Neha Narula (et.al.) for
>counterexamples to this.

This is an interesting point made in "Highly Available Transactions: 
Virtues and Limitations" by Peter Bailis et. al.  
(http://www.bailis.org/papers/hat-vldb2014.pdf) (see section 3 on page 
3):

> Accordingly, to increase concurrency, database systems offer a range 
> of ACID properties weaker than serializability: the host of so-called 
> weak isolation models describe varying restrictions on the space of 
> schedules that are allowable by the system. None of these weak 
> isolation models guarantees serializability, but, as we see below, 
> their benefits are often considered to outweigh costs of possible 
> consistency anomalies that might arise from their use.

Specifically, table 2 (http://i.imgur.com/7Lw9lBd.png) shows databases 
such as MySQL, Postgres, and Oracle all possibly supporting 
serializability, but by default would allow much lower guarantees 
(repeatable reads or read committed), which are high-availability 
transactions.

Repeatable Reads (RR) are defined as follows, which I believe is pretty 
much what MVCC stands for:

> the ANSI standardized implementation-agnostic definition
> is achievable and directly captures the spirit of the term: if a 
> transaction reads the same data more than once, it sees the same value 
> each time (preventing “Fuzzy Read”).

Read Committed (RC) is defined as:

> Under Read Committed, transactions should not access uncommitted or  
> intermediate  versions  of  data  items. This prohibits both “Dirty 
> Writes” [...] and also “Dirty Reads”

And that's about it. This tells you multiple transactions could happen 
at the same time and result in a non-linearizable history. Two RC 
transactions could both operate at once, and through some interleavings 
of read and write locks across transactions, give you results that would 
not make sense without the specific concurrent interleaving they have 
seen. They would not be linearizable or serializable.

An important note here is that a highly available transaction (HAT) is 
defined as a transaction that eventually commits if it can contact at 
least one replica for each of the data items it attempts to touch; This 
is slightly different from the ususal "can I write to this row on any 
given node", but does mean multiple levels of failure (even a majority 
of them) could allow some transactions to still work under RR or RCs.

> As shown in Table 2, only three out of 18 databases provided 
> serializability by default, and eight did not provide serializability 
> as an option at all. [...] Given that these weak transactional models 
> are frequently used, our inability to provide serializability in 
> arbitrary HATs appears non-fatal for practical applications.

It is not explained outright, but I'm guessing the reason why many of 
these transactions are *not* made highly-available via common RDBMs is 
that they're more seen as optimizations for speed, and that it wouldn't 
fit their model very well in the large, or wanting the ability to add 
serializability as a safety guarantee without tearing down your whole 
infrastructure.

Many DBs' default transaction mechanisms have semantics could lend 
themselves to higher availability, but their implementation just doesn't 
appear to support it.