[erlang-questions] unsplit - resolving mnesia inconsistencies

Ulf Wiger ulf.wiger@REDACTED
Thu Feb 4 17:59:33 CET 2010


...no longer using an un-patched mnesia. :-/

http://github.com/uwiger/otp/tree/mnesia_merge

I made a fairly small change to mnesia_schema.erl
and mnesia_monitor.erl, so that I could pass a
wrapper to the schema_merge() function and give as
extra argument to it which tables besides 'schema'
that it should lock.

mnesia_controller:connect_nodes(
   [NodeB],
   fun(MergeF) ->
     case MergeF(TabsAndNodes) of
       {merged,_,_} = Res ->
           show_locks(NodeB),
           %% For now, assume that we have merged with the right
           %% node, and not with others that could also be
           %% consistent (mnesia gurus, how does this work?)
           io:fwrite("stitching: ~p~n", [TabMethods]),
           stitch_tabs(TabMethods, NodeB),
                  Res;
              Other ->
                  Other
     end
   end).

MergeF is the fun given by mnesia_schema:

merge_schema(UserFun) ->
     schema_transaction(fun() ->
          UserFun(fun(Arg) ->
             do_merge_schema(Arg)
         end) end).

Running this, I of course discovered that you still have to
use dirty updates of the separate copies, since replication
is not yet enabled for the nodes being merged.

Here's an extract from my small test run:

+ + + + +
inconsistency. Context = running_partitioned_network; Node = n2@REDACTED
have lock...
nodes_of(test) = [n1@REDACTED,n2@REDACTED]
Affected tabs = [test]
Methods = [{test,{unsplit_lib,last_modified,[]}}]

Held locks = {[[{{test,'______WHOLETABLE_____'},write,{tid,68,<0.117.0>}},
 
{{schema,'______WHOLETABLE_____'},write,{tid,68,<0.117.0>}}],
                [{{test,'______WHOLETABLE_____'},write,{tid,68,<0.117.0>}},
 
{{schema,'______WHOLETABLE_____'},write,{tid,68,<0.117.0>}}]],
               []}
stitching: [{test,{unsplit_lib,last_modified,[]}}]
do_stitch({test,{unsplit_lib,last_modified,[]}}, n2@REDACTED).
Calling unsplit_lib:last_modified(init, 
[test,[key,modified,value]])Starting merge of test ([key,modified,value])
  -> {ok,{test,3}}
Calling unsplit_lib:last_modified([{test,1,{1265,301600,551309},a}], 
[{test,1,{1265,301600,551309},a}], {test,3})
last_version_entry({[{test,1,{1265,301600,551309},a}],
                     [{test,1,{1265,301600,551309},a}]})
ModA = {1265,301600,551309}, ModB = {1265,301600,551309}
  -> {ok,[{write,{test,1,{1265,301600,551309},a}}],same,{test,3}}
+ + + + +

In other words, tables are locked on both nodes before
table deconflict begins.

I have not tried to prepare the mnesia_merge branch as
an official patch (I haven't run the test suites. Then
again, the mnesia test suites are not on github...)

I'm by no means done. :) I'm only testing a very trivial
scenario. What other scenarios would be really interesting
to try?

BR,
Ulf W

Ulf Wiger wrote:
> 
> I have had some modest success with a prototype to
> resolve netsplits in Mnesia.
> 
> I pushed the code to Github:
> 
> http://github.com/uwiger/unsplit
> 
> I think there are still some nasty corner cases that need
> resolving, and I'm not convinced that they can be solved
> without some added functionality in Mnesia.
> 
> Approach:
> 
> An application, unsplit, starts and subscribes to mnesia
> system events.
> 
> When the unsplit_server receives an event signaling that
> the database is inconsistent ({inconsistent_database, Context, Node}),
> it enters a critical section (using global:trans/3) and tries to
> resolve the inconsistency. The critical section is needed since
> both sides will detect the condition at roughly the same time*.
> Note that at this time, the two erlang nodes are in contact, but
> mnesia has aborted the attempt to merge the two nodes, so each
> mnesia instance considers the other side down.
> 
> * I have on occasion seen only one printout of the "partitioned
> network" message, and once or twice the contition has gone
> undetected if I disconnect and reconnect very quickly. This
> bothers me, but I have not been able to reproduce or analyse it.
> 
> When entering the critical section, it checks whether the other
> node is in mnesia's 'running_db_nodes' list. If so, the other
> side has already resolved the inconsistency and nothing more
> needs to be done. Otherwise, it checks which tables have copies
> on both nodes and proceeds to merge them.
> 
> Table locks are taken on all affected tables, but the actual
> reading and writing of data are dirty. The reasons for this are:
> 
> 1. Obviously, locks must be taken, since we don't want others
>    to write to the tables during the process.
> 2. When reading the data, we have to read the two separate copies.
>    The transaction read will not allow this, so we have to read
>    the remote copy through rpc:call(), which means it has to be
>    dirty.
> 3. I was hoping to have a proxy process on the other side that
>    locks its copy, then does the reading and writing before
>    we connect the two mnesia nodes. However, mnesia didn't
>    like this at all, when told to connect the nodes and commit.
> 
> The actual connect of the two sides is accomplished using
> (undocumented) mnesia_monitor:connect_nodes([OtherNode]).
> Currently, I do this before I lock the tables; I believe this
> creates a small race condition, but haven't found any other
> way to do it.
> 
> For each table, an 'unsplit_method' can be defined. The
> default is {unsplit_lib, no_action, []}, which is
> probably seldom a good choice. I have written a method
> called {unsplit_lib, last_modified, []}, which is equivalent
> to {unsplit_lib, last_version, [modified]}, simply comparing
> the value of the 'modified' attribute of each two records
> with the same key (this particular method won't work for
> bags) and picking the latest one - aborting if the objects
> are different, but the 'modified' value are the same.
> 
> For more sophisticated callbacks, I suggest scavenging
> the riak source code for Lamport clocks, Merkle trees
> etc. Obviously, you also need to plan the record
> representation, e.g. if you want to be able to hold
> multiple versions of the same object, like riak does.
> 
> Some test commands are in the file commands.txt.
> 
> I don't think the dist_auto_connect once setting should be
> needed, but it sure makes it easier during testing.
> 
> Everything is still very rough at the edges, but I would
> like some feedback before continuing - not least from the
> Mnesia maintainers.
> 
> BR,
> Ulf W


-- 
Ulf Wiger
CTO, Erlang Solutions Ltd, formerly Erlang Training & Consulting Ltd
http://www.erlang-solutions.com


More information about the erlang-questions mailing list