[erlang-questions] why might mnesia:start() hang?
Rick Pettit
rpettit@REDACTED
Wed Oct 17 22:30:00 CEST 2007
On Wed, Oct 17, 2007 at 02:58:55PM -0500, Rick Pettit wrote:
> On Wed, Oct 17, 2007 at 11:32:14AM +0200, Hakan Mattsson wrote:
> >
> > There may be several causes for this to happen:
> >
> > - It may be the case that some other application has
> > encountered a deadlock in its startup. This may for
> > example occur if that application is invoking functions
> > in the 'application' API during its startup. It may also
> > occur if a process dies during the application
> > startup. Then its supervisor will not restart the
> > process until it has started all its children.
> >
> > - It could also be that it is Mnesia that refuses to
> > start. This may happen if the system first crashes
> > during the critical phase in transaction commit and one
> > of the other nodes does not come up again. Then Mnesia
> > will by default wait indefinitely for the other node to
> > be available before it finishes its own startup. See
> > the documentation about the Mnesia parameter
> > max_wait_for_decision for more info. If you set the
> > Mnesia debug level to at least 'verbose' (before you
> > start Mnesia) you will get a printout when this happens.
>
> I have restarted the sick node with -mnesia debug true, and now I see (among
> other things) in syslog output:
>
> Oct 17 14:39:18 somebox.somedomain foo1: [ID 702911 local6.debug] Mnesia('foo_rel@REDACTED'): Restarting transaction {tid,33070,<0.95.0>}: in 17487ms {cyclic,'bar_rel@REDACTED',{schema,'______WHOLETABLE_____'},write,write,{tid,1,<4257.71.0>}}
> Oct 17 14:39:31 somebox.somedomain foo1: [ID 702911 local6.debug] Mnesia('foo_rel@REDACTED'): Transaction {tid,33072,<0.118.0>} calling #Fun<mnesia_schema.33.14280620> with [] failed:
> Oct 17 14:39:31 somebox.somedomain foo1: [ID 702911 local6.debug] {aborted,{no_exists,some_table}}
> Oct 17 14:39:31 somebox.somedomain foo1: [ID 702911 local6.debug] Mnesia('foo_rel@REDACTED'): Transaction {tid,33071,<0.117.0>} calling #Fun<mnesia_schema.15.92354373> with [] failed:
> Oct 17 14:39:31 somebox.somedomain foo1: [ID 702911 local6.debug] {aborted,{no_exists,some_table}}
>
>
> I suspect that some_table (which is, er, was, a disc_copy table) was deleted on
> the primary node while the secondary was down (it is still down).
>
> At this point it would be sufficient to get mnesia running at all on the
> primary node. I've tried erasing all notion of some_table from the schema,
> but without luck:
>
> (foo_rel@REDACTED)2> mnesia:delete_table(some_table).
> {aborted,{no_exists,some_table}}
> (foo_rel@REDACTED)1> mnesia:del_table_copy(schema,'foo_rel@REDACTED').
> {aborted,{no_exists,some_table}}
Here's some additional information that might help (e.g. I see some_table as
a member of the local_tables list--that's probably not good if mnesia doesn't
think the table exists anymore :-)
(foo_rel@REDACTED)5> mnesia:table_info(schema,all).
[{access_mode,read_write},
{active_replicas,['foo_rel@REDACTED']},
{arity,3},
{attributes,[table,cstruct]},
{checkpoints,[]},
{commit_work,[{subscribers,[<0.54.0>],[]}]},
{cookie,{{1187,632480,715351},'foo_rel@REDACTED'}},
{cstruct,{cstruct,schema,
set,
['bar_rel@REDACTED','bar_rel@REDACTED'],
['foo_rel@REDACTED','foo_rel@REDACTED'],
[],
0,
read_write,
[],
[],
false,
schema,
[table,cstruct],
[],
[],
{{1187,632480,...},'foo_rel@REDACTED'},
{{34,...},{...}}}},
{disc_copies,['foo_rel@REDACTED','foo_rel@REDACTED']},
{disc_only_copies,[]},
{frag_properties,[]},
{index,[]},
{load_by_force,false},
{load_node,'foo_rel@REDACTED'},
{load_order,0},
{load_reason,initial},
{local_content,false},
{local_tables,[schema,
some_table, <==== This is the table mnesia says doesn't exist
another_table,
yet_another_table|...]},
{master_nodes,[]},
{memory,3943},
{ram_copies,['bar_rel@REDACTED','bar_rel@REDACTED']},
{record_name,schema},
{record_validation,{schema,3,set}},
{type,set},
{size,30},
{snmp,[]},
{storage_type,disc_copies},
{subscribers,...},
{...}|...]
Every time I try to delete the table, mnesia says it doesn't exist:
(foo_rel@REDACTED)10> mnesia:delete_table(some_table).
{aborted,{no_exists,some_table}}
And I get the following in the log (each time I run the above command):
Oct 17 15:30:50 somebox.somedomain foo1: [ID 702911 local6.debug] Mnesia('foo_rel@REDACTED'): Transaction {tid,33075,<0.146.0>} calling #Fun<mnesia_schema.11.13572368> with [] failed:
Oct 17 15:30:50 somebox.somedomain foo1: [ID 702911 local6.debug] {aborted,{no_exists,some_table}}
Yet I am able to retrieve info on the table:
(foo_rel@REDACTED)11> mnesia:table_info(some_table,all).
[{access_mode,read_write},
{active_replicas,[]},
{arity,4},
{attributes,[job,key,val]},
{checkpoints,[]},
{commit_work,[]},
{cookie,{{1187,638478,963126},'foo_rel@REDACTED'}},
{cstruct,{cstruct,some_table,
bag,
[],
['foo_rel@REDACTED','foo_rel@REDACTED'],
[],
1,
read_write,
[],
[],
false,
some_table,
[app_job,key,val],
[],
[],
{{1187,638478,...},'foo_rel@REDACTED'},
{{3,...},{...}}}},
{disc_copies,['foo_rel@REDACTED','foo_rel@REDACTED']},
{disc_only_copies,[]},
{frag_properties,[]},
{index,[]},
{load_by_force,false},
{load_node,unknown},
{load_order,1},
{load_reason,unknown},
{local_content,false},
{master_nodes,[]},
{memory,undefined},
{ram_copies,[]},
{record_name,some_table},
{record_validation,{some_table,4,bag}},
{type,bag},
{size,undefined},
{snmp,[]},
{storage_type,disc_copies},
{subscribers,[]},
{user_properties,...},
{...}|...]
===
I see that load_node is unknown--probably due to the fact that the other disk
replica is not active (since someotherbox.somedomain is still halted, and for
purposes of this test I cannot bring it back until I restore service to the
primary node).
Again, I have not yet tried to restart with:
-mnesia max_wait_for_decision Timeout
I figured I would gather as much info as possible, then save what may be the
best (and/or last :-) option left.
Thanks again for all your help,
-Rick
> Clearly the deletion of some_table has really upset mnesia, because I keep
> seeing the following repeated over and over:
>
> Oct 17 14:56:51 somebox.somedomain foo1: [ID 702911 local6.debug] Mnesia('foo_rel@REDACTED'): Restarting transaction {tid,33070,<0.95.0>}: in 31404ms {cyclic,'bar_rel@REDACTED',{schema,'______WHOLETABLE_____'},write,write,{tid,1,<4257.71.0>}}
> Oct 17 14:57:23 somebox.somedomain foo1: [ID 702911 local6.debug] Mnesia('foo_rel@REDACTED'): Restarting transaction {tid,33070,<0.95.0>}: in 31412ms {cyclic,'bar_rel@REDACTED',{schema,'______WHOLETABLE_____'},write,write,{tid,1,<4257.71.0>}}
> Oct 17 14:57:54 somebox.somedomain foo1: [ID 702911 local6.debug] Mnesia('foo_rel@REDACTED'): Restarting transaction {tid,33070,<0.95.0>}: in 32007ms {cyclic,'bar_rel@REDACTED',{schema,'______WHOLETABLE_____'},write,write,{tid,1,<4257.71.0>}}
> Oct 17 14:58:26 somebox.somedomain foo1: [ID 702911 local6.debug] Mnesia('foo_rel@REDACTED'): Restarting transaction {tid,33070,<0.95.0>}: in 32441ms {cyclic,'bar_rel@REDACTED',{schema,'______WHOLETABLE_____'},write,write,{tid,1,<4257.71.0>}}
> Oct 17 14:58:59 somebox.somedomain foo1: [ID 702911 local6.debug] Mnesia('foo_rel@REDACTED'): Restarting transaction {tid,33070,<0.95.0>}: in 32886ms {cyclic,'bar_rel@REDACTED',{schema,'______WHOLETABLE_____'},write,write,{tid,1,<4257.71.0>}}
> Oct 17 14:59:07 somebox.somedomain foo1: [ID 702911 local6.debug] Mnesia('foo_rel@REDACTED'): Transaction log dump initiated by time_threshold: already_dumped
> Oct 17 15:00:05 somebox.somedomain foo1: [ID 702911 local6.debug] Mnesia('foo_rel@REDACTED'): Restarting transaction {tid,33070,<0.95.0>}: in 33459ms {cyclic,'bar_rel@REDACTED',{schema,'______WHOLETABLE_____'},write,write,{tid,1,<4257.71.0>}}
> > RP> Date: Tue, 16 Oct 2007 23:05:36 -0500
> > RP> From: Rick Pettit <rpettit@REDACTED>
> > RP> To: erlang-questions@REDACTED
> > RP> Subject: [erlang-questions] why might mnesia:start() hang?
> > RP>
> > RP> I seem to have encountered a situation in which I am
> > RP> unable to start mnesia. Attempts to start mnesia (via
> > RP> mnesia:start/0) hang the erlang shell.
> > RP>
> > RP> In the scenario below there are 2 physical servers,
> > RP> each running an instance of the foo_rel and
> > RP> bar_rel. The second physical server,
> > RP> someother.somedomain, has been halted prior to starting
> > RP> the nodes somebox.somedomain.
> > RP>
> > RP> The foo_rel instances contain disc_copy tables--bar_rel
> > RP> instances contain ram_copies only.
> > RP>
> > RP> (foo_rel@REDACTED)1> application:which_applications().
> > RP> [{sasl,"SASL CXC 138 11","2.1.5.1"},
> > RP> {stdlib,"ERTS CXC 138 10","1.14.5"},
> > RP> {kernel,"ERTS CXC 138 10","2.11.5"}]
> > RP>
> > RP>
> > RP> NOTE: there are other applications in this release which *should* be running
> > RP> but are not, almost certainly due to the fact that mnesia is refusing
> > RP> to start
> > RP>
> > RP>
> > RP> (foo_rel@REDACTED)2> mnesia:info().
> > RP> ===> System info in version "4.3.5", debug level = none <===
> > RP> opt_disc. Directory "/u1/otp/db/foo_rel" is used.
> > RP> use fallback at restart = false
> > RP> running db nodes = ['foo_rel@REDACTED']
> > RP> stopped db nodes = ['foo_rel@REDACTED','bar_rel@REDACTED','bar_rel@REDACTED']
> > RP> ok
> > RP>
> > RP>
> > RP> (foo_rel@REDACTED)3> mnesia:stop().
> > RP> stopped
> > RP>
> > RP>
> > RP> (foo_rel@REDACTED)4> mnesia:start().
> > RP> ...shell hangs forever...
> > RP>
> > RP>
> > RP> Shell back into the node, try again:
> > RP>
> > RP>
> > RP> (foo_rel@REDACTED)1> mnesia:info().
> > RP> ===> System info in version "4.3.5", debug level = none <===
> > RP> opt_disc. Directory "/u1/otp/db/foo_rel" is used.
> > RP> use fallback at restart = false
> > RP> running db nodes = ['foo_rel@REDACTED']
> > RP> stopped db nodes = ['foo_rel@REDACTED','bar_rel@REDACTED','bar_rel@REDACTED']
> > RP> ok
> > RP>
> > RP>
> > RP> (foo_rel@REDACTED)2> application:which_applications().
> > RP> [{sasl,"SASL CXC 138 11","2.1.5.1"},
> > RP> {stdlib,"ERTS CXC 138 10","1.14.5"},
> > RP> {kernel,"ERTS CXC 138 10","2.11.5"}]
> > RP>
> > RP>
> > RP> (foo_rel@REDACTED)3> mnesia:start().
> > RP> ...hangs forever...
> > RP>
> > RP>
> > RP> ======
> > RP>
> > RP> I cannot seem to figure out:
> > RP>
> > RP> 1) why mnesia refuses to start
> > RP>
> > RP> 2) why mnesia:start() hangs forever at the shell
> > RP> (vs. return an error, etc)
> > RP>
> > RP> Any applications requiring mnesia tables do a
> > RP> mnesia:wait_for_tables/2 on them.
> > RP>
> > RP> A special process performs a mnesia:force_load_table/1,
> > RP> if necessary (e.g. when wait_for_tables/2 times out).
> > RP>
> > RP> Unfortunately, this code doesn't get a chance to run if
> > RP> mnesia itself refuses to start (in many previous test
> > RP> runs the releases started--in some cases the default
> > RP> table load algorithm worked just fine, and in other
> > RP> failure scenarios the force_table_load was
> > RP> necessary--but the system always manged to start until
> > RP> now).
> > RP>
> > RP> Surely I must just be short on coffee (or sleep) or
> > RP> both. Any help would be greatly appreciated.
> > RP>
> > RP> -Rick
> > RP> _______________________________________________
> > RP> erlang-questions mailing list
> > RP> erlang-questions@REDACTED
> > RP> http://www.erlang.org/mailman/listinfo/erlang-questions
More information about the erlang-questions
mailing list