[erlang-bugs] Mnesia table load problem
Thu Oct 17 16:30:21 CEST 2013
Ouch looks like a missing testcase in mnesia.
I don't really know how to solve it either.
I can say that mnesia prefers consistency over durability, but in this case
it sounds like it fails
on both accounts. And most of our customers prefers fast loading over
between tables, i.e. they don't like to wait on stopped nodes.
If you have some ideas, a patch would be nice. But that code is pretty
vulnerable to changes, and have been patched many times over the years.
The consistency problem might be the hardest to solve, maybe the conclusion
is that if you want
a relational database, you should use one and not mnesia.
On Thu, Oct 17, 2013 at 10:36 AM, Szoboszlay Dániel
> I have found a strange, very easy to reproduce problem with Mnesia table
> - Take 3 nodes and a disc_copies table replicated on them.
> - Stop the first node (important to stop the _alphabetically_ first node)
> - Let the remaining two nodes write to the table (transaction/dirty
> context doesn't matter)
> - Kill the remaining two nodes at the same time (e.g. "pkill beam")
> - Restart all 3 nodes
> At this point I would expect the changes made after the first node's stop
> to be present in the database (durability). However, Mnesia decides to
> load the table from the alphabetically first node, which happens to have
> an obviously outdated copy, and replicate it on the rest of the cluster.
> The problem is in mnesia_controller:orphan_**tables/5:
> 1423 %% We're last up and the other nodes have not
> 1424 %% loaded the table. Lets load it if we are
> 1425 %% the smallest node.
> 1426 case lists:min(DiscCopyHolders) of
> 1427 Min when Min == node() ->
> This algorithm simply doesn't rule out DiscCopyHolders that we know that
> cannot have the latest copy of the table as someone has seen them going
> This problem occurred to me on R16B, but according to the git history,
> these lines haven't changed since at least R13B03.
> I was thinking about writing a patch too, but it turns out to be a tricky
> one. Seeing a mnesia_down message defines a partial ordering between the
> nodes. So it would make sense to look for the greatest elements of this
> set and load the table from one of them. If there's only one such element
> (e.g. a node that saw all other nodes with disc_copies going down) the
> choice is trivial (in fact, this scenario already works well in Mnesia).
> But if we have multiple equal nodes, we must make a decision (e.g. picking
> the smallest node).
> The problem is that the mnesia_down messages are currently discarded by
> mnesia_recovery once a node rejoins the cluster. And this happens before
> running the oprhan_tables checks. Furthermore, for correct behaviour we
> would have to track on a per-table basis whether a node has received the
> latest copy of the data. Consider A is stopped first, then B and C. If we
> restart A and B, they cannot load table X distributed on all three nodes,
> but they can load table Y that is not replicated to C. If we stop B then
> A, than regarding table X B has still a fresher copy than A, but regarding
> table Y the copy of A is the latest.
> Implementing this logic is not a trivial fix for the problem. It might
> even introduce new logged events, or affect the inconsistency detection.
> So I would like to hear your opinion about the problem or any other
> solution proposed before attempting to write any code. (I already have a
> test case for reproducing the issue, if you are interested in it.)
> erlang-bugs mailing list
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the erlang-bugs