unsplit - resolving mnesia inconsistencies

Thu Feb 4 11:54:06 CET 2010

I have had some modest success with a prototype to
resolve netsplits in Mnesia.

I pushed the code to Github:

http://github.com/uwiger/unsplit

I think there are still some nasty corner cases that need
resolving, and I'm not convinced that they can be solved
without some added functionality in Mnesia.

Approach:

An application, unsplit, starts and subscribes to mnesia
system events.

When the unsplit_server receives an event signaling that
the database is inconsistent ({inconsistent_database, Context, Node}),
it enters a critical section (using global:trans/3) and tries to
resolve the inconsistency. The critical section is needed since
both sides will detect the condition at roughly the same time*.
Note that at this time, the two erlang nodes are in contact, but
mnesia has aborted the attempt to merge the two nodes, so each
mnesia instance considers the other side down.

* I have on occasion seen only one printout of the "partitioned
network" message, and once or twice the contition has gone
undetected if I disconnect and reconnect very quickly. This
bothers me, but I have not been able to reproduce or analyse it.

When entering the critical section, it checks whether the other
node is in mnesia's 'running_db_nodes' list. If so, the other
side has already resolved the inconsistency and nothing more
needs to be done. Otherwise, it checks which tables have copies
on both nodes and proceeds to merge them.

Table locks are taken on all affected tables, but the actual
reading and writing of data are dirty. The reasons for this are:

1. Obviously, locks must be taken, since we don't want others
    to write to the tables during the process.
2. When reading the data, we have to read the two separate copies.
    The transaction read will not allow this, so we have to read
    the remote copy through rpc:call(), which means it has to be
    dirty.
3. I was hoping to have a proxy process on the other side that
    locks its copy, then does the reading and writing before
    we connect the two mnesia nodes. However, mnesia didn't
    like this at all, when told to connect the nodes and commit.

The actual connect of the two sides is accomplished using
(undocumented) mnesia_monitor:connect_nodes([OtherNode]).
Currently, I do this before I lock the tables; I believe this
creates a small race condition, but haven't found any other
way to do it.

For each table, an 'unsplit_method' can be defined. The
default is {unsplit_lib, no_action, []}, which is
probably seldom a good choice. I have written a method
called {unsplit_lib, last_modified, []}, which is equivalent
to {unsplit_lib, last_version, [modified]}, simply comparing
the value of the 'modified' attribute of each two records
with the same key (this particular method won't work for
bags) and picking the latest one - aborting if the objects
are different, but the 'modified' value are the same.

For more sophisticated callbacks, I suggest scavenging
the riak source code for Lamport clocks, Merkle trees
etc. Obviously, you also need to plan the record
representation, e.g. if you want to be able to hold
multiple versions of the same object, like riak does.

Some test commands are in the file commands.txt.

I don't think the dist_auto_connect once setting should be
needed, but it sure makes it easier during testing.

Everything is still very rough at the edges, but I would
like some feedback before continuing - not least from the
Mnesia maintainers.

BR,
Ulf W
-- 
Ulf Wiger
CTO, Erlang Solutions Ltd, formerly Erlang Training & Consulting Ltd
http://www.erlang-solutions.com
---------------------------------------------------

---------------------------------------------------

WE'VE CHANGED NAMES!

Since January 1st 2010 Erlang Training and Consulting Ltd. has become ERLANG SOLUTIONS LTD.

www.erlang-solutions.com