[erlang-questions] mnesia recovery

Igor Ribeiro Sucupira igorrs@REDACTED
Fri Jul 16 05:24:38 CEST 2010


On Thu, Jul 15, 2010 at 10:41 AM, Evans, Matthew <mevans@REDACTED> wrote:
> Hi,
>
> This is a rather convoluted question.
>
> We have a distributed system with disc copies/disc only copies of mnesia tables on nodes A and B.  Other nodes in the system C
> through M have RAM only copies of those tables.

I'm assuming all nodes have exactly the same tables and the same data
(including the schema). Is that the case? If it's not, could you
describe the pool in more detail?

> Ordinarily if node A fails and recovers shortly later we are fine since mnesia is smart enough to re-sync data back to node A from
> node B.
>
> We hit a situation yesterday where node A failed, some time later the whole distributed system was restarted but node B never
> recovered.

What does that mean? Is node B corrupted? Or is it just refusing to
start because the other nodes are down and B is not the most
up-to-date node? I don't see any other case for "never recovered" and
I'm assuming you have the former (corruption), since you said the
other nodes were restarted and that B has "good" data.

> The logic is such that startup is effectively blocked since we know the "good" data is on node B.
>
> How to handle this in the field? If, for reasons beyond our control node B can not be recovered easily, I am wondering is there a
> way to get the data from node B to node A (I am assuming we can access the partition on node B)?

Assuming B is the most up-to-date node and has some corrupted tables,
you can copy the working files of those tables from some other node to
node B (yeah... they may be outdated, but there's not much to do in
this case) and than start node B. Everything should work fine.

If that's not your problem, maybe this function could help you, anyway:
http://erlang.org/doc/man/mnesia.html#force_load_table-1

I've used force_load_table/1 in situations where Mnesia was refusing
to load the table in some node because it believed its copy was not
current (but I knew it was).

Good luck.
Igor.

> Would it be possible to:
>
> 1) Stop mnesia on all nodes
> 2) Copy the contents of the mnesia directory from node B to node A (minus the schema definitions)
> 3) Empty the mnesia directory from node B
> 4) Restart everything
>
> In this case I am hoping that mnesia would see node A as good and node B as having no data and would copy data to the new
> node B.
>
> Basically this situation needs to be resolved on the field by engineers with little or no Erlang skills. Certainly escripts could be
> written to help.
>
> Thanks
>
> Matt


More information about the erlang-questions mailing list