[erlang-questions] Mnesia does not detect netsplit
Joseph Norton
<
>
Thu Sep 29 11:03:52 CEST 2011
FYI. I posted a suggestion on the mailing list for a network partition detector application.
http://erlang.org/pipermail/erlang-questions/2011-August/060702.html
If you have any questions, please send to me off list.
thanks,
Joseph Norton
On Sep 29, 2011, at 5:55 PM, Jonas Boberg wrote:
> Hi,
>
> We found a case where mnesia does not detect a netsplit.
>
> Let's say we are running two mnesia nodes, A and B:
> At startup, node A can't connect to node B (specified in the mnesia
> config parameter extra_db_nodes). In this case node B is actually
> running, but because of a temporary network issue, or node B being
> heavily loaded, net_kernel:connect fails. When node A and B eventually
> are connected (for example due to a non-mnesia process sending a
> message between the nodes), mnesia does not detect the split, and the
> two isles continue to run separately.
>
> Note that when we say that mnesia does not detect the netsplit, we
> mean that mnesia does not generate any 'inconsistent_database' event.
>
> How to reproduce.
> * In this example we simulate a network problem (net_kernel:connect
> failure) by having the two nodes use different cookies.
> ------------------
> $ erl -name
-mnesia schema_location ram -mnesia
> extra_db_nodes "['
']" -setcookie a
> (
)1> application:start(mnesia),
> mnesia:subscribe(system), mnesia:create_table(my_table, []).
> $ erl -name
-mnesia schema_location ram -mnesia
> extra_db_nodes "['
']" -setcookie b
> (
)1> application:start(mnesia),
> mnesia:subscribe(system), mnesia:create_table(my_other_table, []).
> %% Connect nodes
> (
)2> erlang:set_cookie(node(), b),
> net_kernel:connect('
').
> (
)3> nodes().
> ['
']
> (
)4> mnesia:info().
> ...
> running db nodes = ['
']
> stopped db nodes = ['
']
> ...
>
> ------------------
> Expected behaviour: subscriber gets a 'inconsistent_database' event
> Actual behaviour: subscriber does not get any event.
>
> Compare to this case, where mnesia correctly detects a inconsistent database:
> ------------------
> $ erl -name
-mnesia schema_location ram -mnesia
> extra_db_nodes "['
']" -setcookie a
> (
)1> application:start(mnesia),
> mnesia:subscribe(system), mnesia:create_table(my_table, []).
> $ erl -name
-mnesia schema_location ram
> -mnesia extra_db_nodes "['
']" -setcookie a
> (
)1> application:start(mnesia),
> mnesia:subscribe(system), mnesia:create_table(my_other_table, []).
> (
)2> net_kernel:disconnect('
').
> (
)3> net_kernel:connect('
').
> (
)4> flush().
> Shell got {mnesia_system_event,{mnesia_down,'
'}}
> Shell got {mnesia_system_event,
> {inconsistent_database,running_partitioned_network,
> '
'}}
>
> We found that the mnesia code that detects netsplits is in
> mnesia_monitor. It uses net_kernel:monitor_nodes(true), to monitor
> nodes going up and down. In the problematic scenario, when the
> mnesia_monitor gets the the 'nodeup', it seems to ignore it since a
> node down has not been seen.
> Trace:
> (<0.53.0>) call
> mnesia_monitor:handle_info({nodeup,'
'},{state,<0.52.0>,[],[],true,[],undefined,[]})
> (<0.53.0>) call mnesia_recover:has_mnesia_down('
')
> (<0.53.0>) returned from mnesia_recover:has_mnesia_down/1 -> false
>
> Does anyone have an idea about how we could work around this issue? If
> we would detect the split ourselves, is there anyway we could get
> mnesia to reconnect the nodes?
>
> Regards
> Jonas
> _______________________________________________
> erlang-questions mailing list
>
> http://erlang.org/mailman/listinfo/erlang-questions
More information about the erlang-questions
mailing list