<div dir="ltr">Ouch looks like a missing testcase in mnesia.<div><br></div><div>I don't really know how to solve it either.</div><div><br></div><div>I can say that mnesia prefers consistency over durability, but in this case it sounds like it fails</div>
<div>on both accounts. And most of our customers prefers fast loading over consistency</div><div>between tables, i.e. they don't like to wait on stopped nodes.</div><div><br></div><div>If you have some ideas, a patch would be nice. But that code is pretty vulnerable to changes, and have been patched many times over the years.</div>
<div><br></div><div>The consistency problem might be the hardest to solve, maybe the conclusion is that if you want</div><div>a relational database, you should use one and not mnesia.</div><div><br></div><div>BR</div><div>
Dan</div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Thu, Oct 17, 2013 at 10:36 AM, Szoboszlay Dániel <span dir="ltr"><<a href="mailto:dszoboszlay@gmail.com" target="_blank">dszoboszlay@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br>
<br>
I have found a strange, very easy to reproduce problem with Mnesia table<br>
loading:<br>
- Take 3 nodes and a disc_copies table replicated on them.<br>
- Stop the first node (important to stop the _alphabetically_ first node)<br>
- Let the remaining two nodes write to the table (transaction/dirty<br>
context doesn't matter)<br>
- Kill the remaining two nodes at the same time (e.g. "pkill beam")<br>
- Restart all 3 nodes<br>
<br>
At this point I would expect the changes made after the first node's stop<br>
to be present in the database (durability). However, Mnesia decides to<br>
load the table from the alphabetically first node, which happens to have<br>
an obviously outdated copy, and replicate it on the rest of the cluster.<br>
<br>
The problem is in mnesia_controller:orphan_<u></u>tables/5:<br>
<br>
1423 %% We're last up and the other nodes have not<br>
1424 %% loaded the table. Lets load it if we are<br>
1425 %% the smallest node.<br>
1426 case lists:min(DiscCopyHolders) of<br>
1427 Min when Min == node() -><br>
<br>
This algorithm simply doesn't rule out DiscCopyHolders that we know that<br>
cannot have the latest copy of the table as someone has seen them going<br>
down.<br>
<br>
This problem occurred to me on R16B, but according to the git history,<br>
these lines haven't changed since at least R13B03.<br>
<br>
I was thinking about writing a patch too, but it turns out to be a tricky<br>
one. Seeing a mnesia_down message defines a partial ordering between the<br>
nodes. So it would make sense to look for the greatest elements of this<br>
set and load the table from one of them. If there's only one such element<br>
(e.g. a node that saw all other nodes with disc_copies going down) the<br>
choice is trivial (in fact, this scenario already works well in Mnesia).<br>
But if we have multiple equal nodes, we must make a decision (e.g. picking<br>
the smallest node).<br>
<br>
The problem is that the mnesia_down messages are currently discarded by<br>
mnesia_recovery once a node rejoins the cluster. And this happens before<br>
running the oprhan_tables checks. Furthermore, for correct behaviour we<br>
would have to track on a per-table basis whether a node has received the<br>
latest copy of the data. Consider A is stopped first, then B and C. If we<br>
restart A and B, they cannot load table X distributed on all three nodes,<br>
but they can load table Y that is not replicated to C. If we stop B then<br>
A, than regarding table X B has still a fresher copy than A, but regarding<br>
table Y the copy of A is the latest.<br>
<br>
Implementing this logic is not a trivial fix for the problem. It might<br>
even introduce new logged events, or affect the inconsistency detection.<br>
So I would like to hear your opinion about the problem or any other solution proposed before attempting to write any code. (I already have a test case for reproducing the issue, if you are interested in it.)<br>
<br>
Regards,<br>
Daniel<br>
______________________________<u></u>_________________<br>
erlang-bugs mailing list<br>
<a href="mailto:erlang-bugs@erlang.org" target="_blank">erlang-bugs@erlang.org</a><br>
<a href="http://erlang.org/mailman/listinfo/erlang-bugs" target="_blank">http://erlang.org/mailman/<u></u>listinfo/erlang-bugs</a><br>
</blockquote></div><br></div>