[erlang-questions] why might mnesia:start() hang?

Wed Oct 17 22:30:00 CEST 2007

On Wed, Oct 17, 2007 at 02:58:55PM -0500, Rick Pettit wrote:
> On Wed, Oct 17, 2007 at 11:32:14AM +0200, Hakan Mattsson wrote:
> > 
> > There may be several causes for this to happen:
> > 
> > - It may be the case that some other application has
> >   encountered a deadlock in its startup. This may for
> >   example occur if that application is invoking functions
> >   in the 'application' API during its startup. It may also
> >   occur if a process dies during the application
> >   startup. Then its supervisor will not restart the
> >   process until it has started all its children. 
> > 
> > - It could also be that it is Mnesia that refuses to
> >   start. This may happen if the system first crashes
> >   during the critical phase in transaction commit and one
> >   of the other nodes does not come up again. Then Mnesia
> >   will by default wait indefinitely for the other node to
> >   be available before it finishes its own startup. See
> >   the documentation about the Mnesia parameter
> >   max_wait_for_decision for more info. If you set the
> >   Mnesia debug level to at least 'verbose' (before you
> >   start Mnesia) you will get a printout when this happens.
> 
> I have restarted the sick node with -mnesia debug true, and now I see (among
> other things) in syslog output:
> 
> Oct 17 14:39:18 somebox.somedomain foo1: [ID 702911 local6.debug] Mnesia('foo_rel@REDACTED'): Restarting transaction {tid,33070,<0.95.0>}: in 17487ms {cyclic,'bar_rel@REDACTED',{schema,'______WHOLETABLE_____'},write,write,{tid,1,<4257.71.0>}}
> Oct 17 14:39:31 somebox.somedomain foo1: [ID 702911 local6.debug] Mnesia('foo_rel@REDACTED'): Transaction {tid,33072,<0.118.0>} calling #Fun<mnesia_schema.33.14280620> with [] failed: 
> Oct 17 14:39:31 somebox.somedomain foo1: [ID 702911 local6.debug]  {aborted,{no_exists,some_table}}
> Oct 17 14:39:31 somebox.somedomain foo1: [ID 702911 local6.debug] Mnesia('foo_rel@REDACTED'): Transaction {tid,33071,<0.117.0>} calling #Fun<mnesia_schema.15.92354373> with [] failed: 
> Oct 17 14:39:31 somebox.somedomain foo1: [ID 702911 local6.debug]  {aborted,{no_exists,some_table}}
> 
> 
> I suspect that some_table (which is, er, was, a disc_copy table) was deleted on
> the primary node while the secondary was down (it is still down).
> 
> At this point it would be sufficient to get mnesia running at all on the
> primary node. I've tried erasing all notion of some_table from the schema,
> but without luck:
> 
> (foo_rel@REDACTED)2> mnesia:delete_table(some_table).                 
> {aborted,{no_exists,some_table}}
> (foo_rel@REDACTED)1> mnesia:del_table_copy(schema,'foo_rel@REDACTED'). 
> {aborted,{no_exists,some_table}}

Here's some additional information that might help (e.g. I see some_table as
a member of the local_tables list--that's probably not good if mnesia doesn't
think the table exists anymore :-)

(foo_rel@REDACTED)5> mnesia:table_info(schema,all).
[{access_mode,read_write},
 {active_replicas,['foo_rel@REDACTED']},
 {arity,3},
 {attributes,[table,cstruct]},
 {checkpoints,[]},
 {commit_work,[{subscribers,[<0.54.0>],[]}]},
 {cookie,{{1187,632480,715351},'foo_rel@REDACTED'}},
 {cstruct,{cstruct,schema,
                   set,
                   ['bar_rel@REDACTED','bar_rel@REDACTED'],
                   ['foo_rel@REDACTED','foo_rel@REDACTED'],
                   [],
                   0,
                   read_write,
                   [],
                   [],
                   false,
                   schema,
                   [table,cstruct],
                   [],
                   [],
                   {{1187,632480,...},'foo_rel@REDACTED'},
                   {{34,...},{...}}}},
 {disc_copies,['foo_rel@REDACTED','foo_rel@REDACTED']},
 {disc_only_copies,[]},
 {frag_properties,[]},
 {index,[]},
 {load_by_force,false},
 {load_node,'foo_rel@REDACTED'},
 {load_order,0},
 {load_reason,initial},
 {local_content,false},
 {local_tables,[schema,
                some_table,   <==== This is the table mnesia says doesn't exist
                another_table,
                yet_another_table|...]},
 {master_nodes,[]},
 {memory,3943},
 {ram_copies,['bar_rel@REDACTED','bar_rel@REDACTED']},
 {record_name,schema},
 {record_validation,{schema,3,set}},
 {type,set},
 {size,30},
 {snmp,[]},
 {storage_type,disc_copies},
 {subscribers,...},
 {...}|...]

Every time I try to delete the table, mnesia says it doesn't exist:

(foo_rel@REDACTED)10> mnesia:delete_table(some_table).
{aborted,{no_exists,some_table}}

And I get the following in the log (each time I run the above command):

Oct 17 15:30:50 somebox.somedomain foo1: [ID 702911 local6.debug] Mnesia('foo_rel@REDACTED'): Transaction {tid,33075,<0.146.0>} calling #Fun<mnesia_schema.11.13572368> with [] failed: 
Oct 17 15:30:50 somebox.somedomain foo1: [ID 702911 local6.debug]  {aborted,{no_exists,some_table}}

Yet I am able to retrieve info on the table:

(foo_rel@REDACTED)11> mnesia:table_info(some_table,all).
[{access_mode,read_write},
 {active_replicas,[]},
 {arity,4},
 {attributes,[job,key,val]},
 {checkpoints,[]},
 {commit_work,[]},
 {cookie,{{1187,638478,963126},'foo_rel@REDACTED'}},
 {cstruct,{cstruct,some_table,
                   bag,
                   [],
                   ['foo_rel@REDACTED','foo_rel@REDACTED'],
                   [],
                   1,
                   read_write,
                   [],
                   [],
                   false,
                   some_table,
                   [app_job,key,val],
                   [],
                   [],
                   {{1187,638478,...},'foo_rel@REDACTED'},
                   {{3,...},{...}}}},
 {disc_copies,['foo_rel@REDACTED','foo_rel@REDACTED']},
 {disc_only_copies,[]},
 {frag_properties,[]},
 {index,[]},
 {load_by_force,false},
 {load_node,unknown},
 {load_order,1},
 {load_reason,unknown},
 {local_content,false},
 {master_nodes,[]},
 {memory,undefined},
 {ram_copies,[]},
 {record_name,some_table},
 {record_validation,{some_table,4,bag}},
 {type,bag},
 {size,undefined},
 {snmp,[]},
 {storage_type,disc_copies},
 {subscribers,[]},
 {user_properties,...},
 {...}|...]

===

I see that load_node is unknown--probably due to the fact that the other disk
replica is not active (since someotherbox.somedomain is still halted, and for
purposes of this test I cannot bring it back until I restore service to the
primary node).

Again, I have not yet tried to restart with:

  -mnesia max_wait_for_decision Timeout

I figured I would gather as much info as possible, then save what may be the
best (and/or last :-) option left.

Thanks again for all your help,

-Rick

> Clearly the deletion of some_table has really upset mnesia, because I keep
> seeing the following repeated over and over:
> 
> Oct 17 14:56:51 somebox.somedomain foo1: [ID 702911 local6.debug] Mnesia('foo_rel@REDACTED'): Restarting transaction {tid,33070,<0.95.0>}: in 31404ms {cyclic,'bar_rel@REDACTED',{schema,'______WHOLETABLE_____'},write,write,{tid,1,<4257.71.0>}}
> Oct 17 14:57:23 somebox.somedomain foo1: [ID 702911 local6.debug] Mnesia('foo_rel@REDACTED'): Restarting transaction {tid,33070,<0.95.0>}: in 31412ms {cyclic,'bar_rel@REDACTED',{schema,'______WHOLETABLE_____'},write,write,{tid,1,<4257.71.0>}}
> Oct 17 14:57:54 somebox.somedomain foo1: [ID 702911 local6.debug] Mnesia('foo_rel@REDACTED'): Restarting transaction {tid,33070,<0.95.0>}: in 32007ms {cyclic,'bar_rel@REDACTED',{schema,'______WHOLETABLE_____'},write,write,{tid,1,<4257.71.0>}}
> Oct 17 14:58:26 somebox.somedomain foo1: [ID 702911 local6.debug] Mnesia('foo_rel@REDACTED'): Restarting transaction {tid,33070,<0.95.0>}: in 32441ms {cyclic,'bar_rel@REDACTED',{schema,'______WHOLETABLE_____'},write,write,{tid,1,<4257.71.0>}}
> Oct 17 14:58:59 somebox.somedomain foo1: [ID 702911 local6.debug] Mnesia('foo_rel@REDACTED'): Restarting transaction {tid,33070,<0.95.0>}: in 32886ms {cyclic,'bar_rel@REDACTED',{schema,'______WHOLETABLE_____'},write,write,{tid,1,<4257.71.0>}}
> Oct 17 14:59:07 somebox.somedomain foo1: [ID 702911 local6.debug] Mnesia('foo_rel@REDACTED'): Transaction log dump initiated by time_threshold: already_dumped
> Oct 17 15:00:05 somebox.somedomain foo1: [ID 702911 local6.debug] Mnesia('foo_rel@REDACTED'): Restarting transaction {tid,33070,<0.95.0>}: in 33459ms {cyclic,'bar_rel@REDACTED',{schema,'______WHOLETABLE_____'},write,write,{tid,1,<4257.71.0>}}

> > RP> Date: Tue, 16 Oct 2007 23:05:36 -0500
> > RP> From: Rick Pettit <rpettit@REDACTED>
> > RP> To: erlang-questions@REDACTED
> > RP> Subject: [erlang-questions] why might mnesia:start() hang?
> > RP> 
> > RP> I seem to have encountered a situation in which I am
> > RP> unable to start mnesia.  Attempts to start mnesia (via
> > RP> mnesia:start/0) hang the erlang shell.
> > RP> 
> > RP> In the scenario below there are 2 physical servers,
> > RP> each running an instance of the foo_rel and
> > RP> bar_rel. The second physical server,
> > RP> someother.somedomain, has been halted prior to starting
> > RP> the nodes somebox.somedomain.
> > RP> 
> > RP> The foo_rel instances contain disc_copy tables--bar_rel
> > RP> instances contain ram_copies only.
> > RP> 
> > RP> (foo_rel@REDACTED)1> application:which_applications().
> > RP> [{sasl,"SASL  CXC 138 11","2.1.5.1"},
> > RP>  {stdlib,"ERTS  CXC 138 10","1.14.5"},
> > RP>  {kernel,"ERTS  CXC 138 10","2.11.5"}]
> > RP> 
> > RP> 
> > RP>    NOTE: there are other applications in this release which *should* be running
> > RP>          but are not, almost certainly due to the fact that mnesia is refusing
> > RP>          to start
> > RP> 
> > RP> 
> > RP> (foo_rel@REDACTED)2> mnesia:info().
> > RP> ===> System info in version "4.3.5", debug level = none <===
> > RP> opt_disc. Directory "/u1/otp/db/foo_rel" is used.
> > RP> use fallback at restart = false
> > RP> running db nodes   = ['foo_rel@REDACTED']
> > RP> stopped db nodes   = ['foo_rel@REDACTED','bar_rel@REDACTED','bar_rel@REDACTED'] 
> > RP> ok
> > RP> 
> > RP> 
> > RP> (foo_rel@REDACTED)3> mnesia:stop().
> > RP> stopped
> > RP> 
> > RP> 
> > RP> (foo_rel@REDACTED)4> mnesia:start().
> > RP> ...shell hangs forever...
> > RP> 
> > RP> 
> > RP> Shell back into the node, try again:
> > RP> 
> > RP> 
> > RP> (foo_rel@REDACTED)1> mnesia:info().
> > RP> ===> System info in version "4.3.5", debug level = none <===
> > RP> opt_disc. Directory "/u1/otp/db/foo_rel" is used.
> > RP> use fallback at restart = false
> > RP> running db nodes   = ['foo_rel@REDACTED']
> > RP> stopped db nodes   = ['foo_rel@REDACTED','bar_rel@REDACTED','bar_rel@REDACTED'] 
> > RP> ok
> > RP> 
> > RP> 
> > RP> (foo_rel@REDACTED)2> application:which_applications().
> > RP> [{sasl,"SASL  CXC 138 11","2.1.5.1"},
> > RP>  {stdlib,"ERTS  CXC 138 10","1.14.5"},
> > RP>  {kernel,"ERTS  CXC 138 10","2.11.5"}]
> > RP> 
> > RP> 
> > RP> (foo_rel@REDACTED)3> mnesia:start().
> > RP> ...hangs forever...
> > RP> 
> > RP> 
> > RP> ======
> > RP> 
> > RP> I cannot seem to figure out:
> > RP> 
> > RP>   1) why mnesia refuses to start
> > RP> 
> > RP>   2) why mnesia:start() hangs forever at the shell
> > RP>     (vs. return an error, etc)
> > RP> 
> > RP> Any applications requiring mnesia tables do a
> > RP> mnesia:wait_for_tables/2 on them.
> > RP> 
> > RP> A special process performs a mnesia:force_load_table/1,
> > RP> if necessary (e.g. when wait_for_tables/2 times out).
> > RP> 
> > RP> Unfortunately, this code doesn't get a chance to run if
> > RP> mnesia itself refuses to start (in many previous test
> > RP> runs the releases started--in some cases the default
> > RP> table load algorithm worked just fine, and in other
> > RP> failure scenarios the force_table_load was
> > RP> necessary--but the system always manged to start until
> > RP> now).
> > RP> 
> > RP> Surely I must just be short on coffee (or sleep) or
> > RP> both. Any help would be greatly appreciated.
> > RP> 
> > RP> -Rick
> > RP> _______________________________________________
> > RP> erlang-questions mailing list
> > RP> erlang-questions@REDACTED
> > RP> http://www.erlang.org/mailman/listinfo/erlang-questions