[erlang-questions] Problems with Mnesia and failure situations
Dan Gudmundsson
dgud@REDACTED
Wed Apr 22 14:03:02 CEST 2009
That is bug, happens when using qlc inside a mnesia_transaction and a another node goes down.
Patch:
ct diff -diff_format -pre src/mnesia_tm.erl
2197c2197
< send_to_pids([Tid#tid.pid | get_elements(friends,Store)], Msg).
---
> send_to_pids([Tid#tid.pid | [Pid || Pid <- get_elements(friends,Store),
is_pid(Pid)], Msg).
/Dan
Teemu Antti-Poika wrote:
> Hello world,
>
> we're in the process of adopting Erlang as a part of our project and
> we've run into a problem with Mnesia's High Availability features. I'd
> very much appreciate any insights you may have on where we go wrong.
>
> Our setup is a two-node cluster and the intention is to have, among
> other things, replicated mnesia tables. Both nodes read/write from the
> tables. The assumption is that if one node goes down the other one
> continues alone until the failover problem is fixed.
>
> Our failover tests failed horribly: when one node was taken out, the
> other died as well. I've managed to isolate the problem to a one-file
> test below:
>
>
> == 8< === Test code begins
> -module(jb_testing).
> -include_lib("stdlib/include/qlc.hrl").
>
> -compile(export_all).
>
> -record(message, {id,
> state}).
>
> %% run once
> setup(Nodes) ->
> create_table(message, [{type, set}, {ram_copies, Nodes},
> {attributes, record_info(fields, message)}]),
> ok = mnesia:wait_for_tables([message], 30000).
>
> create_table(Table, TableDefinition) ->
> case mnesia:create_table(Table, TableDefinition) of
> {atomic, ok} ->
> error_logger:info_msg("Created table ~p~n", [Table]),
> ok;
> {aborted, {already_exists, Table}} ->
> ok;
> {aborted, Reason} ->
> exit(Reason)
> end.
>
> load_mnesia() ->
> F = fun() ->
> C = qlc:cursor(qlc:q([M || M <- mnesia:table(message),
> M#message.state =:= new ])),
> case qlc:next_answers(C, 1) of
> [] ->
> none;
> [_M] ->
> none
> end,
> ok = qlc:delete_cursor(C)
>
> end,
> mnesia:transaction(F),
> % timer:sleep(100),
> load_mnesia().
>
> == 8< === Test code ends
>
> Here's how I run the code to demonstrate the problem (node names are examples):
> - On server1: create_schema([egw@REDACTED,
> egw@REDACTED])
> - On both nodes: mnesia:start().
> - On server1: jb_testing:setup([egw@REDACTED,
> egw@REDACTED]).
> - On server2: jb_testing:load_mnesia(). This starts busy-looping and
> creating load for mnesia from current process, i.e. locks up your
> shell.
> - On server1: halt().
>
> After a short while server2 reports mnesia as crashed. Some sample
> logging from server2, with mnesia debugging enabled:
>
> Erlang R13B (erts-5.7.1) [source] [64-bit] [smp:2:2] [rq:2]
> [async-threads:0] [hipe] [kernel-poll:false]
>
> Eshell V5.7.1 (abort with ^G)
> (egw@REDACTED)1> mnesia:start().
> Mnesia('egw@REDACTED'): mnesia_monitor starting: <0.69.0>
> Mnesia('egw@REDACTED'): Version: "4.4.9"
> [...lines of non-interesting start-up logging removed for brevity...]
> Mnesia('egw@REDACTED'): Mnesia started, 0 seconds
> ok
> (egw@REDACTED)2> Mnesia('egw@REDACTED'):
> Transaction log dump skipped (optional): schema_prepare
> Mnesia('egw@REDACTED'): Logging mnesia_up 'egw@REDACTED'
>
> (egw@REDACTED)3> Mnesia('egw@REDACTED'): write
> performed by {tid,4,<3261.108.0>} on record:
> {schema,message,
> [{name,message},
> {type,set},
> {ram_copies,['egw@REDACTED',
> 'egw@REDACTED']},
> {disc_copies,[]},
> {disc_only_copies,[]},
> {load_order,0},
> {access_mode,read_write},
> {index,[]},
> {snmp,[]},
> {local_content,false},
> {record_name,message},
> {attributes,[id,state]},
> {user_properties,[]},
> {frag_properties,[]},
> {cookie,{{1240,390114,462850},'egw@REDACTED'}},
> {version,{{2,0},[]}}]}
> Mnesia('egw@REDACTED'): Transaction log dump skipped
> (optional): schema_prepare
> Mnesia('egw@REDACTED'): write performed by
> {tid,4,<3261.108.0>} on record:
> {schema,message,
> [{name,message},
> {type,set},
> {ram_copies,['egw@REDACTED',
> 'egw@REDACTED']},
> {disc_copies,[]},
> {disc_only_copies,[]},
> {load_order,0},
> {access_mode,read_write},
> {index,[]},
> {snmp,[]},
> {local_content,false},
> {record_name,message},
> {attributes,[id,state]},
> {user_properties,[]},
> {frag_properties,[]},
> {cookie,{{1240,390114,462850},'egw@REDACTED'}},
> {version,{{2,0},[]}}]}
> Mnesia('egw@REDACTED'): Getting table message (ram_copies)
> from disc: {dumper,
>
> create_table}
>
> (egw@REDACTED)3>
> (egw@REDACTED)3> jb_testing:ping_mnesia().
> Mnesia('egw@REDACTED'): Logging mnesia_down
> 'egw@REDACTED'
> Mnesia('egw@REDACTED'): Got mnesia_down from
> 'egw@REDACTED', reconfiguring...
> Mnesia('egw@REDACTED'): mnesia_monitor got FATAL ERROR
> from: <0.73.0>
>
> =ERROR REPORT==== 22-Apr-2009::11:49:17 ===
> Mnesia('egw@REDACTED'): ** ERROR ** (core dumped to file:
> "/home/jetbet/MnesiaCore.egw@REDACTED")
> ** FATAL ** mnesia_tm crashed: {badarg,
> [{mnesia_tm,send_to_pids,2},
> {mnesia_tm,reconfigure_coordinators,2},
> {mnesia_tm,doit_loop,1},
> {mnesia_sp,init_proc,4},
> {proc_lib,init_p_do_apply,3}]}
> state: [<0.68.0>]
> Mnesia('egw@REDACTED'): mnesia_controller terminated: shutdown
>
> =ERROR REPORT==== 22-Apr-2009::11:49:27 ===
> ** Generic server mnesia_monitor terminating
> ** Last message in was {'EXIT',<0.68.0>,killed}
> ** When Server state == {state,<0.68.0>,[],
> ['egw@REDACTED'],
> true,[],undefined,[]}
> ** Reason for termination ==
> ** killed
>
> =ERROR REPORT==== 22-Apr-2009::11:49:27 ===
> ** Generic server mnesia_recover terminating
> ** Last message in was {'EXIT',<0.68.0>,killed}
> ** When Server state == {state,<0.68.0>,undefined,undefined,undefined,0,true,
> []}
> ** Reason for termination ==
> ** killed
> ** exception exit: killed
> (egw@REDACTED)4>
> =INFO REPORT==== 22-Apr-2009::11:49:27 ===
> application: mnesia
> exited: killed
> type: temporary
>
>
> Produced core dump is not recognized by crashdump_viewer ("...is not
> an Erlang crash dump").
>
> Some notes:
> - Erlang version info: Erlang R13B (erts-5.7.1) [source] [64-bit]
> [smp:2:2] [rq:2] [async-threads:0] [hipe] [kernel-poll:false]
> - Same problem occurs with R13A
> - Our OS is Centos (RedHat) linux, SMP-enabled 64-bit 2.6.18 kernel on Intel.
>
> - if we reduce the load for mnesia by introducing sleep (100 ms) in
> the test loop, the problem goes away or is at least less likely to
> appear.
> - we're using RAM tables in the example. Changing tables to
> disc_copies makes no difference.
> - changing the data lookup to the form
> F = fun() ->
> '$end_of_table' = mnesia:first(message)
> end,
> works: halt() on another node does not bring the busy one down.
>
> Any ideas?
>
> Thank you already in advance,
> Teemu Antti-Poika
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://www.erlang.org/mailman/listinfo/erlang-questions
>
More information about the erlang-questions
mailing list