[erlang-questions] why might mnesia:start() hang?

Wed Oct 17 23:55:10 CEST 2007

On Wed, Oct 17, 2007 at 03:30:00PM -0500, Rick Pettit wrote:
> On Wed, Oct 17, 2007 at 02:58:55PM -0500, Rick Pettit wrote:
> > On Wed, Oct 17, 2007 at 11:32:14AM +0200, Hakan Mattsson wrote:
> > > 
> > > There may be several causes for this to happen:
> > > 
> > > - It may be the case that some other application has
> > >   encountered a deadlock in its startup. This may for
> > >   example occur if that application is invoking functions
> > >   in the 'application' API during its startup. It may also
> > >   occur if a process dies during the application
> > >   startup. Then its supervisor will not restart the
> > >   process until it has started all its children. 
> > > 
> > > - It could also be that it is Mnesia that refuses to
> > >   start. This may happen if the system first crashes
> > >   during the critical phase in transaction commit and one
> > >   of the other nodes does not come up again. Then Mnesia
> > >   will by default wait indefinitely for the other node to
> > >   be available before it finishes its own startup. See
> > >   the documentation about the Mnesia parameter
> > >   max_wait_for_decision for more info. If you set the
> > >   Mnesia debug level to at least 'verbose' (before you
> > >   start Mnesia) you will get a printout when this happens.

[snip]

> > At this point it would be sufficient to get mnesia running at all on the
> > primary node. I've tried erasing all notion of some_table from the schema,
> > but without luck:
> > 
> > (foo_rel@REDACTED)2> mnesia:delete_table(some_table).                 
> > {aborted,{no_exists,some_table}}
> > (foo_rel@REDACTED)1> mnesia:del_table_copy(schema,'foo_rel@REDACTED'). 
> > {aborted,{no_exists,some_table}}
> 
> Here's some additional information that might help (e.g. I see some_table as
> a member of the local_tables list--that's probably not good if mnesia doesn't
> think the table exists anymore :-)

[snip]

Ok, I finally bit the bullet and restarted the primary node:

  foo_rel@REDACTED

with max_wait_for_decision set to ===> 10000    (i.e. 10 seconds)

I waited for the node to startup, watching the logs--this time after 10
seconds mnesia did jump into action and force a bunch of transactions
to complete, at which the node was *almost* completely up (one application,
a distributed application, was not started for some reason).

Though mnesia started, certain requisite tables were not still not loaded.
I have a distributed application which should have started on the primary
and force loaded all requisite tables (after first failing in an attempt
to mnesia wait for same tables)--but for some reason the distributed 
application controller did not even try to start it.

When I finally shelled into the node and explicitly ran the force
load procedure (which the distributed application had always run
automatically in these instances in the past), the remaining applications came
to life (including the distributed application that formerly refused to
start). At this point the node was finally back in service, ready to
handle production work.

Are there any distributed application experts that might know why the
distributed application controller refused to start the distributed
applicaiton on the primary?

In previous test runs in which no schema modifications were hosed at
shutdown, the distributed application always did the "right thing", 
bouncing from primary to secondary node and back again as one or the
other node was halted--sometimes finding all the tables in good shape,
other times timing out in wait_for_tables/2 and triggering a 
force_load_table/1 on each--but this is the first time the application
refused to start at all (forcing me to do its work by hand).

In either event thank you once again for your help--I have managed to
get the primary node back in service without having to restart the
secondary.

Now I just need to get to the bottom of this distributed application
problem so the system recovers from this on its own next time so I
can sleep in... :-)

-Rick

P.S. To be clear the system I'm working on here has previously recovered
     from the following failure scenario on its own (among many others):

     1) start servers A and B (with nodes running on each which replicate
        disc_copy tables, and a distributed application which bounces
        between A and B, preferring to run on A as primary)

     2) halt server B

     3) halt server A

     4) restart server A

        NOTE: nodes on server A start but the default mnesia table
              load algorithm is unable to load the tables, since it
              believes replicas on B might be more recent, but B is
              unreachable

        NOTE: this is where the distributed application enters the 
              picture--it starts, calls wait_for_tables/2 on all
              requisite tables, and on timeout calls force_load_table/1 
              for each table which could not be loaded by the default
              algorithm.

        NOTE: up until recently this always worked, but then I managed
              to hit a scenario in which a schema transaction was
              affected by the halt, such that mnesia itself would not
              start without -mnesia max_wait_for_desicion 10000--with
              this addition mnesia managed to start but the distributed
              application did not, so the tables which required force
              loading were not force loaded and other dependent
              applications eventually gave up in their own wait_for_tables
              leaving the node in a "bad state" :-)

P.P.S. Thanks for putting up with my ramblings as I try to work through
       this on too much coffee and too little sleep :-)