[erlang-questions] Mnesia does not detect netsplit

Jonas Boberg <>
Thu Sep 29 10:55:02 CEST 2011


Hi,

We found a case where mnesia does not detect a netsplit.

Let's say we are running two mnesia nodes, A and B:
At startup, node A can't connect to node B (specified in the mnesia
config parameter extra_db_nodes). In this case node B is actually
running, but because of a temporary network issue, or node B being
heavily loaded, net_kernel:connect fails. When node A and B eventually
are connected (for example due to a non-mnesia process sending a
message between the nodes), mnesia does not detect the split, and the
two isles continue to run separately.

Note that when we say that mnesia does not detect the netsplit, we
mean that mnesia does not generate any 'inconsistent_database' event.

How to reproduce.
* In this example we simulate a network problem (net_kernel:connect
failure) by having the two nodes use different cookies.
------------------
$ erl -name  -mnesia schema_location ram -mnesia
extra_db_nodes "['']" -setcookie a
()1> application:start(mnesia),
mnesia:subscribe(system), mnesia:create_table(my_table, []).
$ erl -name  -mnesia schema_location ram -mnesia
extra_db_nodes "['']" -setcookie b
()1> application:start(mnesia),
mnesia:subscribe(system), mnesia:create_table(my_other_table, []).
%% Connect nodes
()2> erlang:set_cookie(node(), b),
net_kernel:connect('').
()3> nodes().
['']
()4> mnesia:info().
...
running db nodes   = ['']
stopped db nodes   = ['']
...

------------------
Expected behaviour: subscriber gets a 'inconsistent_database' event
Actual behaviour: subscriber does not get any event.

Compare to this case, where mnesia correctly detects a inconsistent database:
------------------
$ erl -name  -mnesia schema_location ram -mnesia
extra_db_nodes "['']" -setcookie a
()1> application:start(mnesia),
mnesia:subscribe(system), mnesia:create_table(my_table, []).
$ erl -name   -mnesia schema_location ram
-mnesia extra_db_nodes "['']" -setcookie a
()1> application:start(mnesia),
mnesia:subscribe(system), mnesia:create_table(my_other_table, []).
()2> net_kernel:disconnect('').
()3> net_kernel:connect('').
()4> flush().
Shell got {mnesia_system_event,{mnesia_down,''}}
Shell got {mnesia_system_event,
             {inconsistent_database,running_partitioned_network,
                 ''}}

We found that the mnesia code that detects netsplits is in
mnesia_monitor. It uses net_kernel:monitor_nodes(true), to monitor
nodes going up and down. In the problematic scenario, when the
mnesia_monitor gets the the 'nodeup', it seems to ignore it since a
node down has not been seen.
Trace:
(<0.53.0>) call
mnesia_monitor:handle_info({nodeup,''},{state,<0.52.0>,[],[],true,[],undefined,[]})
(<0.53.0>) call mnesia_recover:has_mnesia_down('')
(<0.53.0>) returned from mnesia_recover:has_mnesia_down/1 -> false

Does anyone have an idea about how we could work around this issue? If
we would detect the split ourselves, is there anyway we could get
mnesia to reconnect the nodes?

Regards
Jonas



More information about the erlang-questions mailing list