[erlang-questions] why might mnesia:start() hang?

Wed Oct 17 21:58:55 CEST 2007

On Wed, Oct 17, 2007 at 11:32:14AM +0200, Hakan Mattsson wrote:
> 
> There may be several causes for this to happen:
> 
> - It may be the case that some other application has
>   encountered a deadlock in its startup. This may for
>   example occur if that application is invoking functions
>   in the 'application' API during its startup. It may also
>   occur if a process dies during the application
>   startup. Then its supervisor will not restart the
>   process until it has started all its children. 
> 
> - It could also be that it is Mnesia that refuses to
>   start. This may happen if the system first crashes
>   during the critical phase in transaction commit and one
>   of the other nodes does not come up again. Then Mnesia
>   will by default wait indefinitely for the other node to
>   be available before it finishes its own startup. See
>   the documentation about the Mnesia parameter
>   max_wait_for_decision for more info. If you set the
>   Mnesia debug level to at least 'verbose' (before you
>   start Mnesia) you will get a printout when this happens.

I have restarted the sick node with -mnesia debug true, and now I see (among
other things) in syslog output:

Oct 17 14:39:18 somebox.somedomain foo1: [ID 702911 local6.debug] Mnesia('foo_rel@REDACTED'): Restarting transaction {tid,33070,<0.95.0>}: in 17487ms {cyclic,'bar_rel@REDACTED',{schema,'______WHOLETABLE_____'},write,write,{tid,1,<4257.71.0>}}
Oct 17 14:39:31 somebox.somedomain foo1: [ID 702911 local6.debug] Mnesia('foo_rel@REDACTED'): Transaction {tid,33072,<0.118.0>} calling #Fun<mnesia_schema.33.14280620> with [] failed: 
Oct 17 14:39:31 somebox.somedomain foo1: [ID 702911 local6.debug]  {aborted,{no_exists,some_table}}
Oct 17 14:39:31 somebox.somedomain foo1: [ID 702911 local6.debug] Mnesia('foo_rel@REDACTED'): Transaction {tid,33071,<0.117.0>} calling #Fun<mnesia_schema.15.92354373> with [] failed: 
Oct 17 14:39:31 somebox.somedomain foo1: [ID 702911 local6.debug]  {aborted,{no_exists,some_table}}

I suspect that some_table (which is, er, was, a disc_copy table) was deleted on
the primary node while the secondary was down (it is still down).

At this point it would be sufficient to get mnesia running at all on the
primary node. I've tried erasing all notion of some_table from the schema,
but without luck:

(foo_rel@REDACTED)2> mnesia:delete_table(some_table).                 
{aborted,{no_exists,some_table}}
(foo_rel@REDACTED)1> mnesia:del_table_copy(schema,'foo_rel@REDACTED'). 
{aborted,{no_exists,some_table}}

Clearly the deletion of some_table has really upset mnesia, because I keep
seeing the following repeated over and over:

Oct 17 14:56:51 somebox.somedomain foo1: [ID 702911 local6.debug] Mnesia('foo_rel@REDACTED'): Restarting transaction {tid,33070,<0.95.0>}: in 31404ms {cyclic,'bar_rel@REDACTED',{schema,'______WHOLETABLE_____'},write,write,{tid,1,<4257.71.0>}}
Oct 17 14:57:23 somebox.somedomain foo1: [ID 702911 local6.debug] Mnesia('foo_rel@REDACTED'): Restarting transaction {tid,33070,<0.95.0>}: in 31412ms {cyclic,'bar_rel@REDACTED',{schema,'______WHOLETABLE_____'},write,write,{tid,1,<4257.71.0>}}
Oct 17 14:57:54 somebox.somedomain foo1: [ID 702911 local6.debug] Mnesia('foo_rel@REDACTED'): Restarting transaction {tid,33070,<0.95.0>}: in 32007ms {cyclic,'bar_rel@REDACTED',{schema,'______WHOLETABLE_____'},write,write,{tid,1,<4257.71.0>}}
Oct 17 14:58:26 somebox.somedomain foo1: [ID 702911 local6.debug] Mnesia('foo_rel@REDACTED'): Restarting transaction {tid,33070,<0.95.0>}: in 32441ms {cyclic,'bar_rel@REDACTED',{schema,'______WHOLETABLE_____'},write,write,{tid,1,<4257.71.0>}}
Oct 17 14:58:59 somebox.somedomain foo1: [ID 702911 local6.debug] Mnesia('foo_rel@REDACTED'): Restarting transaction {tid,33070,<0.95.0>}: in 32886ms {cyclic,'bar_rel@REDACTED',{schema,'______WHOLETABLE_____'},write,write,{tid,1,<4257.71.0>}}
Oct 17 14:59:07 somebox.somedomain foo1: [ID 702911 local6.debug] Mnesia('foo_rel@REDACTED'): Transaction log dump initiated by time_threshold: already_dumped
Oct 17 14:59:32 somebox.somedomain foo1: [ID 702911 local6.debug] Mnesia('foo_rel@REDACTED'): Restarting transaction {tid,33070,<0.95.0>}: in 33081ms {cyclic,'bar_rel@REDACTED',{schema,'______WHOLETABLE_____'},write,write,{tid,1,<4257.71.0>}}
Oct 17 15:00:05 somebox.somedomain foo1: [ID 702911 local6.debug] Mnesia('foo_rel@REDACTED'): Restarting transaction {tid,33070,<0.95.0>}: in 33459ms {cyclic,'bar_rel@REDACTED',{schema,'______WHOLETABLE_____'},write,write,{tid,1,<4257.71.0>}}
Oct 17 15:00:38 somebox.somedomain foo1: [ID 702911 local6.debug] Mnesia('foo_rel@REDACTED'): Restarting transaction {tid,33070,<0.95.0>}: in 33699ms {cyclic,'bar_rel@REDACTED',{schema,'______WHOLETABLE_____'},write,write,{tid,1,<4257.71.0>}}
Oct 17 15:01:12 somebox.somedomain foo1: [ID 702911 local6.debug] Mnesia('foo_rel@REDACTED'): Restarting transaction {tid,33070,<0.95.0>}: in 33966ms {cyclic,'bar_rel@REDACTED',{schema,'______WHOLETABLE_____'},write,write,{tid,1,<4257.71.0>}}
Oct 17 15:01:46 somebox.somedomain foo1: [ID 702911 local6.debug] Mnesia('foo_rel@REDACTED'): Restarting transaction {tid,33070,<0.95.0>}: in 34583ms {cyclic,'bar_rel@REDACTED',{schema,'______WHOLETABLE_____'},write,write,{tid,1,<4257.71.0>}}
Oct 17 15:02:07 somebox.somedomain foo1: [ID 702911 local6.debug] Mnesia('foo_rel@REDACTED'): Transaction log dump initiated by time_threshold: already_dumped
Oct 17 15:02:20 somebox.somedomain foo1: [ID 702911 local6.debug] Mnesia('foo_rel@REDACTED'): Restarting transaction {tid,33070,<0.95.0>}: in 34827ms {cyclic,'bar_rel@REDACTED',{schema,'______WHOLETABLE_____'},write,write,{tid,1,<4257.71.0>}}

I have not yet tried to restart the node w/ -mnesia_wait_for_decision Timeout--
I was hoping to gain a better understanding of the current situation first.

I think at this point it would be sufficient to get mnesia to forget all 
about some_table. :-)

-Rick

> On Tue, 16 Oct 2007, Rick Pettit wrote:
> 
> RP> Date: Tue, 16 Oct 2007 23:05:36 -0500
> RP> From: Rick Pettit <rpettit@REDACTED>
> RP> To: erlang-questions@REDACTED
> RP> Subject: [erlang-questions] why might mnesia:start() hang?
> RP> 
> RP> I seem to have encountered a situation in which I am
> RP> unable to start mnesia.  Attempts to start mnesia (via
> RP> mnesia:start/0) hang the erlang shell.
> RP> 
> RP> In the scenario below there are 2 physical servers,
> RP> each running an instance of the foo_rel and
> RP> bar_rel. The second physical server,
> RP> someother.somedomain, has been halted prior to starting
> RP> the nodes somebox.somedomain.
> RP> 
> RP> The foo_rel instances contain disc_copy tables--bar_rel
> RP> instances contain ram_copies only.
> RP> 
> RP> (foo_rel@REDACTED)1> application:which_applications().
> RP> [{sasl,"SASL  CXC 138 11","2.1.5.1"},
> RP>  {stdlib,"ERTS  CXC 138 10","1.14.5"},
> RP>  {kernel,"ERTS  CXC 138 10","2.11.5"}]
> RP> 
> RP> 
> RP>    NOTE: there are other applications in this release which *should* be running
> RP>          but are not, almost certainly due to the fact that mnesia is refusing
> RP>          to start
> RP> 
> RP> 
> RP> (foo_rel@REDACTED)2> mnesia:info().
> RP> ===> System info in version "4.3.5", debug level = none <===
> RP> opt_disc. Directory "/u1/otp/db/foo_rel" is used.
> RP> use fallback at restart = false
> RP> running db nodes   = ['foo_rel@REDACTED']
> RP> stopped db nodes   = ['foo_rel@REDACTED','bar_rel@REDACTED','bar_rel@REDACTED'] 
> RP> ok
> RP> 
> RP> 
> RP> (foo_rel@REDACTED)3> mnesia:stop().
> RP> stopped
> RP> 
> RP> 
> RP> (foo_rel@REDACTED)4> mnesia:start().
> RP> ...shell hangs forever...
> RP> 
> RP> 
> RP> Shell back into the node, try again:
> RP> 
> RP> 
> RP> (foo_rel@REDACTED)1> mnesia:info().
> RP> ===> System info in version "4.3.5", debug level = none <===
> RP> opt_disc. Directory "/u1/otp/db/foo_rel" is used.
> RP> use fallback at restart = false
> RP> running db nodes   = ['foo_rel@REDACTED']
> RP> stopped db nodes   = ['foo_rel@REDACTED','bar_rel@REDACTED','bar_rel@REDACTED'] 
> RP> ok
> RP> 
> RP> 
> RP> (foo_rel@REDACTED)2> application:which_applications().
> RP> [{sasl,"SASL  CXC 138 11","2.1.5.1"},
> RP>  {stdlib,"ERTS  CXC 138 10","1.14.5"},
> RP>  {kernel,"ERTS  CXC 138 10","2.11.5"}]
> RP> 
> RP> 
> RP> (foo_rel@REDACTED)3> mnesia:start().
> RP> ...hangs forever...
> RP> 
> RP> 
> RP> ======
> RP> 
> RP> I cannot seem to figure out:
> RP> 
> RP>   1) why mnesia refuses to start
> RP> 
> RP>   2) why mnesia:start() hangs forever at the shell
> RP>     (vs. return an error, etc)
> RP> 
> RP> Any applications requiring mnesia tables do a
> RP> mnesia:wait_for_tables/2 on them.
> RP> 
> RP> A special process performs a mnesia:force_load_table/1,
> RP> if necessary (e.g. when wait_for_tables/2 times out).
> RP> 
> RP> Unfortunately, this code doesn't get a chance to run if
> RP> mnesia itself refuses to start (in many previous test
> RP> runs the releases started--in some cases the default
> RP> table load algorithm worked just fine, and in other
> RP> failure scenarios the force_table_load was
> RP> necessary--but the system always manged to start until
> RP> now).
> RP> 
> RP> Surely I must just be short on coffee (or sleep) or
> RP> both. Any help would be greatly appreciated.
> RP> 
> RP> -Rick
> RP> _______________________________________________
> RP> erlang-questions mailing list
> RP> erlang-questions@REDACTED
> RP> http://www.erlang.org/mailman/listinfo/erlang-questions