[erlang-questions] Mnesia does not detect netsplit

Joseph Norton <>
Thu Sep 29 11:03:52 CEST 2011


FYI.  I posted a suggestion on the mailing list for a network partition detector application. 

http://erlang.org/pipermail/erlang-questions/2011-August/060702.html

If you have any questions, please send to me off list.

thanks,

Joseph Norton




On Sep 29, 2011, at 5:55 PM, Jonas Boberg wrote:

> Hi,
> 
> We found a case where mnesia does not detect a netsplit.
> 
> Let's say we are running two mnesia nodes, A and B:
> At startup, node A can't connect to node B (specified in the mnesia
> config parameter extra_db_nodes). In this case node B is actually
> running, but because of a temporary network issue, or node B being
> heavily loaded, net_kernel:connect fails. When node A and B eventually
> are connected (for example due to a non-mnesia process sending a
> message between the nodes), mnesia does not detect the split, and the
> two isles continue to run separately.
> 
> Note that when we say that mnesia does not detect the netsplit, we
> mean that mnesia does not generate any 'inconsistent_database' event.
> 
> How to reproduce.
> * In this example we simulate a network problem (net_kernel:connect
> failure) by having the two nodes use different cookies.
> ------------------
> $ erl -name  -mnesia schema_location ram -mnesia
> extra_db_nodes "['']" -setcookie a
> ()1> application:start(mnesia),
> mnesia:subscribe(system), mnesia:create_table(my_table, []).
> $ erl -name  -mnesia schema_location ram -mnesia
> extra_db_nodes "['']" -setcookie b
> ()1> application:start(mnesia),
> mnesia:subscribe(system), mnesia:create_table(my_other_table, []).
> %% Connect nodes
> ()2> erlang:set_cookie(node(), b),
> net_kernel:connect('').
> ()3> nodes().
> ['']
> ()4> mnesia:info().
> ...
> running db nodes   = ['']
> stopped db nodes   = ['']
> ...
> 
> ------------------
> Expected behaviour: subscriber gets a 'inconsistent_database' event
> Actual behaviour: subscriber does not get any event.
> 
> Compare to this case, where mnesia correctly detects a inconsistent database:
> ------------------
> $ erl -name  -mnesia schema_location ram -mnesia
> extra_db_nodes "['']" -setcookie a
> ()1> application:start(mnesia),
> mnesia:subscribe(system), mnesia:create_table(my_table, []).
> $ erl -name   -mnesia schema_location ram
> -mnesia extra_db_nodes "['']" -setcookie a
> ()1> application:start(mnesia),
> mnesia:subscribe(system), mnesia:create_table(my_other_table, []).
> ()2> net_kernel:disconnect('').
> ()3> net_kernel:connect('').
> ()4> flush().
> Shell got {mnesia_system_event,{mnesia_down,''}}
> Shell got {mnesia_system_event,
>             {inconsistent_database,running_partitioned_network,
>                 ''}}
> 
> We found that the mnesia code that detects netsplits is in
> mnesia_monitor. It uses net_kernel:monitor_nodes(true), to monitor
> nodes going up and down. In the problematic scenario, when the
> mnesia_monitor gets the the 'nodeup', it seems to ignore it since a
> node down has not been seen.
> Trace:
> (<0.53.0>) call
> mnesia_monitor:handle_info({nodeup,''},{state,<0.52.0>,[],[],true,[],undefined,[]})
> (<0.53.0>) call mnesia_recover:has_mnesia_down('')
> (<0.53.0>) returned from mnesia_recover:has_mnesia_down/1 -> false
> 
> Does anyone have an idea about how we could work around this issue? If
> we would detect the split ourselves, is there anyway we could get
> mnesia to reconnect the nodes?
> 
> Regards
> Jonas
> _______________________________________________
> erlang-questions mailing list
> 
> http://erlang.org/mailman/listinfo/erlang-questions




More information about the erlang-questions mailing list