[erlang-questions] mnesia bug?

Mon Mar 3 14:00:14 CET 2008

It is harmless in this case indeed, however, in my setup I actually had 
two connected nodes (NodeA and NodeC) with disc_copies tables, and NodeB 
happened to be connected through an unreliable link with mnesia having a 
single ram_copies table (different from tables on NodeA and NodeC).  The 
end-result of a partitioned network between NodeA&NodeB and NodeC is 
that mnesia reports an error and stops replicating tables between NodeA 
and NodeB, even though node NodeC had nothing to do with NodeA and NodeB 
except for sharing the schema.  This requires restart of all nodes 
NodeA, NodeB and NodeC (or restart of mnesia application on all nodes), 
which is not a desirable property as NodeA and NodeB shouldn't have been 
impacted by an unreliable network link to NodeC.

Perhaps before reporting a network partition condition, the database 
could check if the schema is different on all nodes?

The only solution is to follow a variation of Uffe's advise (*) on 
recovering mnesia from a partitioned network - run the NodeC node with 
{dist_auto_connect, once} and upon detecting a nodedown event on NodeC 
enabling user-level heartbeat, and upon receiving a response from NodeA 
or NodeB restart NodeC.

This actually becomes even more complex if NodeC is behind a firewall 
that only allows connections to NodeC from NodeA or NodeB but not the 
reverse.  In this case NodeC has to be running with {dist_auto_connect, 
never}, and upon receiving a nodedown event has to shut down mnesia and 
turn on a UDP/TCP listener that would echo user-level pings coming from 
NodeA/NodeB.  When an echoed ping reaches back NodeA or NodeB, those 
nodes would do net_kernel:connect(NodeC).  The firewall would allow this 
connection through and NodeC would detect nodeup event and would start 
mnesia application.

This sounds a bit more complicated than one would desire for running 
clusters of nodes with in depended mnesia tables.

Serge

(*) search the list for "mnesia", "partitioned network".

Dan Gudmundsson wrote:
> Intentional (but harmless in your case), the schema table is still 
> shared between the nodes.
> 
> The partitioned network means the two or more nodes have been up
> when the network link between some of them them have been down, it 
> doesn't check or care which tables resides where.
> 
> /Dan
> 
> Serge Aleynikov wrote:
>> I recently ran into this behavior of mnesia that seemed odd.
>>
>> 1. Mnesia is running on node A and node B, and node B has 
>> extra_db_nodes environment set to [A].
>> 2. Node A has a disc_copies table T.
>> 3. Node B has no replica of table T and accesses it remotely.
>> 4. Node A and node B start up and B calls
>>     mnesia:wait_for_tables([T], Timeout).
>> 5. A some point later either network access between nodes A and B lost 
>> and restored (e.g. via calls to net_kernel:disconnect/1, 
>> net_kernel:connect_node/1).
>>
>> 6. Mnesia on node A reports a *partitioned network* event.
>>
>> This seems strange as node B has no ram or disk copies of any table 
>> and node A should not be reporting this event as its tables are still 
>> consistent.
>>
>> Can anyone comment on whether it's a bug or intended design?
>>
>> Serge
>>
>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://www.erlang.org/mailman/listinfo/erlang-questions
>>
>