<div dir="ltr">Ouch looks like a missing testcase in mnesia.<div><br></div><div>I don't really know how to solve it either.</div><div><br></div><div>I can say that mnesia prefers consistency over durability, but in this case it sounds like it fails</div>

<div>on both accounts. And most of our customers prefers fast loading over consistency</div><div>between tables, i.e. they don't like to wait on stopped nodes.</div><div><br></div><div>If you have some ideas, a patch would be nice. But that code is pretty vulnerable to changes, and have been patched many times over the years.</div>

<div><br></div><div>The consistency problem might be the hardest to solve, maybe the conclusion is that if you want</div><div>a relational database, you should use one and not mnesia.</div><div><br></div><div>BR</div><div>

Dan</div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Thu, Oct 17, 2013 at 10:36 AM, Szoboszlay Dániel <span dir="ltr"><<a href="mailto:dszoboszlay@gmail.com" target="_blank">dszoboszlay@gmail.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br>

<br>

I have found a strange, very easy to reproduce problem with Mnesia table<br>

loading:<br>

- Take 3 nodes and a disc_copies table replicated on them.<br>

- Stop the first node (important to stop the _alphabetically_ first node)<br>

- Let the remaining two nodes write to the table (transaction/dirty<br>

context doesn't matter)<br>

- Kill the remaining two nodes at the same time (e.g. "pkill beam")<br>

- Restart all 3 nodes<br>

<br>

At this point I would expect the changes made after the first node's stop<br>

to be present in the database (durability). However, Mnesia decides to<br>

load the table from the alphabetically first node, which happens to have<br>

an obviously outdated copy, and replicate it on the rest of the cluster.<br>

<br>

The problem is in mnesia_controller:orphan_<u></u>tables/5:<br>

<br>

1423 %% We're last up and the other nodes have not<br>

1424 %% loaded the table. Lets load it if we are<br>

1425 %% the smallest node.<br>

1426 case lists:min(DiscCopyHolders) of<br>

1427    Min when Min == node() -><br>

<br>

This algorithm simply doesn't rule out DiscCopyHolders that we know that<br>

cannot have the latest copy of the table as someone has seen them going<br>

down.<br>

<br>

This problem occurred to me on R16B, but according to the git history,<br>

these lines haven't changed since at least R13B03.<br>

<br>

I was thinking about writing a patch too, but it turns out to be a tricky<br>

one. Seeing a mnesia_down message defines a partial ordering between the<br>

nodes. So it would make sense to look for the greatest elements of this<br>

set and load the table from one of them. If there's only one such element<br>

(e.g. a node that saw all other nodes with disc_copies going down) the<br>

choice is trivial (in fact, this scenario already works well in Mnesia).<br>

But if we have multiple equal nodes, we must make a decision (e.g. picking<br>

the smallest node).<br>

<br>

The problem is that the mnesia_down messages are currently discarded by<br>

mnesia_recovery once a node rejoins the cluster. And this happens before<br>

running the oprhan_tables checks. Furthermore, for correct behaviour we<br>

would have to track on a per-table basis whether a node has received the<br>

latest copy of the data. Consider A is stopped first, then B and C. If we<br>

restart A and B, they cannot load table X distributed on all three nodes,<br>

but they can load table Y that is not replicated to C. If we stop B then<br>

A, than regarding table X B has still a fresher copy than A, but regarding<br>

table Y the copy of A is the latest.<br>

<br>

Implementing this logic is not a trivial fix for the problem. It might<br>

even introduce new logged events, or affect the inconsistency detection.<br>

So I would like to hear your opinion about the problem or any other solution proposed before attempting to write any code. (I already have a test case for reproducing the issue, if you are interested in it.)<br>

<br>

Regards,<br>

Daniel<br>

______________________________<u></u>_________________<br>

erlang-bugs mailing list<br>

<a href="mailto:erlang-bugs@erlang.org" target="_blank">erlang-bugs@erlang.org</a><br>

<a href="http://erlang.org/mailman/listinfo/erlang-bugs" target="_blank">http://erlang.org/mailman/<u></u>listinfo/erlang-bugs</a><br>

</blockquote></div><br></div>