[erlang-bugs] Mnesia table load problem

Thu Oct 17 10:36:07 CEST 2013

Hi,

I have found a strange, very easy to reproduce problem with Mnesia table
loading:
- Take 3 nodes and a disc_copies table replicated on them.
- Stop the first node (important to stop the _alphabetically_ first node)
- Let the remaining two nodes write to the table (transaction/dirty
context doesn't matter)
- Kill the remaining two nodes at the same time (e.g. "pkill beam")
- Restart all 3 nodes

At this point I would expect the changes made after the first node's stop
to be present in the database (durability). However, Mnesia decides to
load the table from the alphabetically first node, which happens to have
an obviously outdated copy, and replicate it on the rest of the cluster.

The problem is in mnesia_controller:orphan_tables/5:

1423 %% We're last up and the other nodes have not
1424 %% loaded the table. Lets load it if we are
1425 %% the smallest node.
1426 case lists:min(DiscCopyHolders) of
1427 	Min when Min == node() ->

This algorithm simply doesn't rule out DiscCopyHolders that we know that
cannot have the latest copy of the table as someone has seen them going
down.

This problem occurred to me on R16B, but according to the git history,
these lines haven't changed since at least R13B03.

I was thinking about writing a patch too, but it turns out to be a tricky
one. Seeing a mnesia_down message defines a partial ordering between the
nodes. So it would make sense to look for the greatest elements of this
set and load the table from one of them. If there's only one such element
(e.g. a node that saw all other nodes with disc_copies going down) the
choice is trivial (in fact, this scenario already works well in Mnesia).
But if we have multiple equal nodes, we must make a decision (e.g. picking
the smallest node).

The problem is that the mnesia_down messages are currently discarded by
mnesia_recovery once a node rejoins the cluster. And this happens before
running the oprhan_tables checks. Furthermore, for correct behaviour we
would have to track on a per-table basis whether a node has received the
latest copy of the data. Consider A is stopped first, then B and C. If we
restart A and B, they cannot load table X distributed on all three nodes,
but they can load table Y that is not replicated to C. If we stop B then
A, than regarding table X B has still a fresher copy than A, but regarding
table Y the copy of A is the latest.

Implementing this logic is not a trivial fix for the problem. It might
even introduce new logged events, or affect the inconsistency detection.
So I would like to hear your opinion about the problem or any other  
solution proposed before attempting to write any code. (I already have a  
test case for reproducing the issue, if you are interested in it.)

Regards,
Daniel