[erlang-questions] why might mnesia:start() hang?

Rick Pettit rpettit@REDACTED
Wed Oct 17 16:49:01 CEST 2007


On Wed, Oct 17, 2007 at 11:32:14AM +0200, Hakan Mattsson wrote:
> 
> There may be several causes for this to happen:
> 
> - It may be the case that some other application has
>   encountered a deadlock in its startup. This may for
>   example occur if that application is invoking functions
>   in the 'application' API during its startup. It may also
>   occur if a process dies during the application
>   startup. Then its supervisor will not restart the
>   process until it has started all its children. 

I will check but don't believe this is my case.

> - It could also be that it is Mnesia that refuses to
>   start. This may happen if the system first crashes
>   during the critical phase in transaction commit and one
>   of the other nodes does not come up again. Then Mnesia
>   will by default wait indefinitely for the other node to
>   be available before it finishes its own startup.

This is what I suspect happened, just don't know the details (yet :-).

>   See
>   the documentation about the Mnesia parameter
>   max_wait_for_decision for more info.

I stumbled on this last night but have not yet gotten a chance to 
try it--thanks.

> If you set the
>   Mnesia debug level to at least 'verbose' (before you
>   start Mnesia) you will get a printout when this happens.

I'll try that, too.

Thanks again for all your help,

-Rick

> On Tue, 16 Oct 2007, Rick Pettit wrote:
> 
> RP> Date: Tue, 16 Oct 2007 23:05:36 -0500
> RP> From: Rick Pettit <rpettit@REDACTED>
> RP> To: erlang-questions@REDACTED
> RP> Subject: [erlang-questions] why might mnesia:start() hang?
> RP> 
> RP> I seem to have encountered a situation in which I am
> RP> unable to start mnesia.  Attempts to start mnesia (via
> RP> mnesia:start/0) hang the erlang shell.
> RP> 
> RP> In the scenario below there are 2 physical servers,
> RP> each running an instance of the foo_rel and
> RP> bar_rel. The second physical server,
> RP> someother.somedomain, has been halted prior to starting
> RP> the nodes somebox.somedomain.
> RP> 
> RP> The foo_rel instances contain disc_copy tables--bar_rel
> RP> instances contain ram_copies only.
> RP> 
> RP> (foo_rel@REDACTED)1> application:which_applications().
> RP> [{sasl,"SASL  CXC 138 11","2.1.5.1"},
> RP>  {stdlib,"ERTS  CXC 138 10","1.14.5"},
> RP>  {kernel,"ERTS  CXC 138 10","2.11.5"}]
> RP> 
> RP> 
> RP>    NOTE: there are other applications in this release which *should* be running
> RP>          but are not, almost certainly due to the fact that mnesia is refusing
> RP>          to start
> RP> 
> RP> 
> RP> (foo_rel@REDACTED)2> mnesia:info().
> RP> ===> System info in version "4.3.5", debug level = none <===
> RP> opt_disc. Directory "/u1/otp/db/foo_rel" is used.
> RP> use fallback at restart = false
> RP> running db nodes   = ['foo_rel@REDACTED']
> RP> stopped db nodes   = ['foo_rel@REDACTED','bar_rel@REDACTED','bar_rel@REDACTED'] 
> RP> ok
> RP> 
> RP> 
> RP> (foo_rel@REDACTED)3> mnesia:stop().
> RP> stopped
> RP> 
> RP> 
> RP> (foo_rel@REDACTED)4> mnesia:start().
> RP> ...shell hangs forever...
> RP> 
> RP> 
> RP> Shell back into the node, try again:
> RP> 
> RP> 
> RP> (foo_rel@REDACTED)1> mnesia:info().
> RP> ===> System info in version "4.3.5", debug level = none <===
> RP> opt_disc. Directory "/u1/otp/db/foo_rel" is used.
> RP> use fallback at restart = false
> RP> running db nodes   = ['foo_rel@REDACTED']
> RP> stopped db nodes   = ['foo_rel@REDACTED','bar_rel@REDACTED','bar_rel@REDACTED'] 
> RP> ok
> RP> 
> RP> 
> RP> (foo_rel@REDACTED)2> application:which_applications().
> RP> [{sasl,"SASL  CXC 138 11","2.1.5.1"},
> RP>  {stdlib,"ERTS  CXC 138 10","1.14.5"},
> RP>  {kernel,"ERTS  CXC 138 10","2.11.5"}]
> RP> 
> RP> 
> RP> (foo_rel@REDACTED)3> mnesia:start().
> RP> ...hangs forever...
> RP> 
> RP> 
> RP> ======
> RP> 
> RP> I cannot seem to figure out:
> RP> 
> RP>   1) why mnesia refuses to start
> RP> 
> RP>   2) why mnesia:start() hangs forever at the shell
> RP>     (vs. return an error, etc)
> RP> 
> RP> Any applications requiring mnesia tables do a
> RP> mnesia:wait_for_tables/2 on them.
> RP> 
> RP> A special process performs a mnesia:force_load_table/1,
> RP> if necessary (e.g. when wait_for_tables/2 times out).
> RP> 
> RP> Unfortunately, this code doesn't get a chance to run if
> RP> mnesia itself refuses to start (in many previous test
> RP> runs the releases started--in some cases the default
> RP> table load algorithm worked just fine, and in other
> RP> failure scenarios the force_table_load was
> RP> necessary--but the system always manged to start until
> RP> now).
> RP> 
> RP> Surely I must just be short on coffee (or sleep) or
> RP> both. Any help would be greatly appreciated.
> RP> 
> RP> -Rick
> RP> _______________________________________________
> RP> erlang-questions mailing list
> RP> erlang-questions@REDACTED
> RP> http://www.erlang.org/mailman/listinfo/erlang-questions



More information about the erlang-questions mailing list