[erlang-questions] Detecting 'inconsistent_database' and other Mnesia events

Scott Lystig Fritchie fritchie@REDACTED
Thu Jan 11 18:22:39 CET 2007


>>>>> "uw" == Ulf Wiger <ulf@REDACTED> writes:

uw> We approached it differently. There are other things to worry
uw> about besides inconsistencies. Sometimes, the database may not
uw> come up at all.

Yup, I agree.  However ... the customer I'm working for would not be
excessively forgiving if I didn't try to avoid a race condition like
Mnesia event subscription.  {shrug}

Many thanks to Serge for sleuthing the term used (eventually) by
mnesia_event:report_error/2.  Definitely absurd, but avoids the race
condition.

I've encountered the situation that Ulf describes.  Fortunately, not
in the field.  :-)

uw> It's possible that a
uw> generic decision support framework could be made out of what we've
uw> done, but you shouldn't hold your breath waiting for us to do that
uw> (you know - if it works, don't touch it, and all that.)

Our solution involves less code and more human intervention:

   * Using -kernel dist_auto_connect once

   * Only one mechanism for reconnecting nodes: a human runs the
     init.d restart script to restart the entire application on a
     node.

   * On app startup, mnesia:wait_for_tables/2 is used with a 10
     second timeout to wait for a known small table (with a higher
     'load_order' priority).  If it can't be loaded in that amount of
     time, an alarm is set and an alert message is logged to identify
     the missing nodes.  That table is replicated similarly to all
     other tables, so (I hope) failing to load that table means
     possible trouble with any other table.

   * Bundled a utility to make it easy for Ops technicians to run
     mnesia:set_master_nodes/1 if it's necessary.

Until someone wants to pay me overtime to reinvent Ericsson's decision
support wheel, the above has had a positive-enough reaction with
customers and their network ops staffs.

-Scott



More information about the erlang-questions mailing list