mnesia recovery

Thu Jul 15 15:41:01 CEST 2010

Hi,

This is a rather convoluted question.

We have a distributed system with disc copies/disc only copies of mnesia tables on nodes A and B.  Other nodes in the system C through M have RAM only copies of those tables.

Ordinarily if node A fails and recovers shortly later we are fine since mnesia is smart enough to re-sync data back to node A from node B.

We hit a situation yesterday where node A failed, some time later the whole distributed system was restarted but node B never recovered.

The logic is such that startup is effectively blocked since we know the "good" data is on node B.

How to handle this in the field? If, for reasons beyond our control node B can not be recovered easily, I am wondering is there a way to get the data from node B to node A (I am assuming we can access the partition on node B)?

Would it be possible to:

1) Stop mnesia on all nodes
2) Copy the contents of the mnesia directory from node B to node A (minus the schema definitions)
3) Empty the mnesia directory from node B
4) Restart everything

In this case I am hoping that mnesia would see node A as good and node B as having no data and would copy data to the new node B.

Basically this situation needs to be resolved on the field by engineers with little or no Erlang skills. Certainly escripts could be written to help.

Thanks

Matt