[erlang-questions] Problems with Mnesia and failure situations
Teemu Antti-Poika
anttipoi@REDACTED
Wed Apr 22 13:13:03 CEST 2009
Hello world,
we're in the process of adopting Erlang as a part of our project and
we've run into a problem with Mnesia's High Availability features. I'd
very much appreciate any insights you may have on where we go wrong.
Our setup is a two-node cluster and the intention is to have, among
other things, replicated mnesia tables. Both nodes read/write from the
tables. The assumption is that if one node goes down the other one
continues alone until the failover problem is fixed.
Our failover tests failed horribly: when one node was taken out, the
other died as well. I've managed to isolate the problem to a one-file
test below:
== 8< === Test code begins
-module(jb_testing).
-include_lib("stdlib/include/qlc.hrl").
-compile(export_all).
-record(message, {id,
state}).
%% run once
setup(Nodes) ->
create_table(message, [{type, set}, {ram_copies, Nodes},
{attributes, record_info(fields, message)}]),
ok = mnesia:wait_for_tables([message], 30000).
create_table(Table, TableDefinition) ->
case mnesia:create_table(Table, TableDefinition) of
{atomic, ok} ->
error_logger:info_msg("Created table ~p~n", [Table]),
ok;
{aborted, {already_exists, Table}} ->
ok;
{aborted, Reason} ->
exit(Reason)
end.
load_mnesia() ->
F = fun() ->
C = qlc:cursor(qlc:q([M || M <- mnesia:table(message),
M#message.state =:= new ])),
case qlc:next_answers(C, 1) of
[] ->
none;
[_M] ->
none
end,
ok = qlc:delete_cursor(C)
end,
mnesia:transaction(F),
% timer:sleep(100),
load_mnesia().
== 8< === Test code ends
Here's how I run the code to demonstrate the problem (node names are examples):
- On server1: create_schema([egw@REDACTED,
egw@REDACTED])
- On both nodes: mnesia:start().
- On server1: jb_testing:setup([egw@REDACTED,
egw@REDACTED]).
- On server2: jb_testing:load_mnesia(). This starts busy-looping and
creating load for mnesia from current process, i.e. locks up your
shell.
- On server1: halt().
After a short while server2 reports mnesia as crashed. Some sample
logging from server2, with mnesia debugging enabled:
Erlang R13B (erts-5.7.1) [source] [64-bit] [smp:2:2] [rq:2]
[async-threads:0] [hipe] [kernel-poll:false]
Eshell V5.7.1 (abort with ^G)
(egw@REDACTED)1> mnesia:start().
Mnesia('egw@REDACTED'): mnesia_monitor starting: <0.69.0>
Mnesia('egw@REDACTED'): Version: "4.4.9"
[...lines of non-interesting start-up logging removed for brevity...]
Mnesia('egw@REDACTED'): Mnesia started, 0 seconds
ok
(egw@REDACTED)2> Mnesia('egw@REDACTED'):
Transaction log dump skipped (optional): schema_prepare
Mnesia('egw@REDACTED'): Logging mnesia_up 'egw@REDACTED'
(egw@REDACTED)3> Mnesia('egw@REDACTED'): write
performed by {tid,4,<3261.108.0>} on record:
{schema,message,
[{name,message},
{type,set},
{ram_copies,['egw@REDACTED',
'egw@REDACTED']},
{disc_copies,[]},
{disc_only_copies,[]},
{load_order,0},
{access_mode,read_write},
{index,[]},
{snmp,[]},
{local_content,false},
{record_name,message},
{attributes,[id,state]},
{user_properties,[]},
{frag_properties,[]},
{cookie,{{1240,390114,462850},'egw@REDACTED'}},
{version,{{2,0},[]}}]}
Mnesia('egw@REDACTED'): Transaction log dump skipped
(optional): schema_prepare
Mnesia('egw@REDACTED'): write performed by
{tid,4,<3261.108.0>} on record:
{schema,message,
[{name,message},
{type,set},
{ram_copies,['egw@REDACTED',
'egw@REDACTED']},
{disc_copies,[]},
{disc_only_copies,[]},
{load_order,0},
{access_mode,read_write},
{index,[]},
{snmp,[]},
{local_content,false},
{record_name,message},
{attributes,[id,state]},
{user_properties,[]},
{frag_properties,[]},
{cookie,{{1240,390114,462850},'egw@REDACTED'}},
{version,{{2,0},[]}}]}
Mnesia('egw@REDACTED'): Getting table message (ram_copies)
from disc: {dumper,
create_table}
(egw@REDACTED)3>
(egw@REDACTED)3> jb_testing:ping_mnesia().
Mnesia('egw@REDACTED'): Logging mnesia_down
'egw@REDACTED'
Mnesia('egw@REDACTED'): Got mnesia_down from
'egw@REDACTED', reconfiguring...
Mnesia('egw@REDACTED'): mnesia_monitor got FATAL ERROR
from: <0.73.0>
=ERROR REPORT==== 22-Apr-2009::11:49:17 ===
Mnesia('egw@REDACTED'): ** ERROR ** (core dumped to file:
"/home/jetbet/MnesiaCore.egw@REDACTED")
** FATAL ** mnesia_tm crashed: {badarg,
[{mnesia_tm,send_to_pids,2},
{mnesia_tm,reconfigure_coordinators,2},
{mnesia_tm,doit_loop,1},
{mnesia_sp,init_proc,4},
{proc_lib,init_p_do_apply,3}]}
state: [<0.68.0>]
Mnesia('egw@REDACTED'): mnesia_controller terminated: shutdown
=ERROR REPORT==== 22-Apr-2009::11:49:27 ===
** Generic server mnesia_monitor terminating
** Last message in was {'EXIT',<0.68.0>,killed}
** When Server state == {state,<0.68.0>,[],
['egw@REDACTED'],
true,[],undefined,[]}
** Reason for termination ==
** killed
=ERROR REPORT==== 22-Apr-2009::11:49:27 ===
** Generic server mnesia_recover terminating
** Last message in was {'EXIT',<0.68.0>,killed}
** When Server state == {state,<0.68.0>,undefined,undefined,undefined,0,true,
[]}
** Reason for termination ==
** killed
** exception exit: killed
(egw@REDACTED)4>
=INFO REPORT==== 22-Apr-2009::11:49:27 ===
application: mnesia
exited: killed
type: temporary
Produced core dump is not recognized by crashdump_viewer ("...is not
an Erlang crash dump").
Some notes:
- Erlang version info: Erlang R13B (erts-5.7.1) [source] [64-bit]
[smp:2:2] [rq:2] [async-threads:0] [hipe] [kernel-poll:false]
- Same problem occurs with R13A
- Our OS is Centos (RedHat) linux, SMP-enabled 64-bit 2.6.18 kernel on Intel.
- if we reduce the load for mnesia by introducing sleep (100 ms) in
the test loop, the problem goes away or is at least less likely to
appear.
- we're using RAM tables in the example. Changing tables to
disc_copies makes no difference.
- changing the data lookup to the form
F = fun() ->
'$end_of_table' = mnesia:first(message)
end,
works: halt() on another node does not bring the busy one down.
Any ideas?
Thank you already in advance,
Teemu Antti-Poika
More information about the erlang-questions
mailing list