[erlang-questions] Problems with Mnesia and failure situations
Dan Gudmundsson
dgud@REDACTED
Wed Apr 22 14:34:56 CEST 2009
Note to self compile the code before sending patches all over the world.
A tested version will be included in the next patch release :-)
/Dan
faenor:mnesia> ct diff -diff_format -pre src/mnesia_tm.erl
2197c2197
< send_to_pids([Tid#tid.pid | get_elements(friends,Store)], Msg).
---
> send_to_pids([Tid#tid.pid | [Pid || Pid <- get_elements(friends,Store),
is_pid(Pid)]], Msg).
Dan Gudmundsson wrote:
> That is bug, happens when using qlc inside a mnesia_transaction and a
> another node goes down.
>
> Patch:
>
> ct diff -diff_format -pre src/mnesia_tm.erl
> 2197c2197
> < send_to_pids([Tid#tid.pid | get_elements(friends,Store)], Msg).
> ---
> > send_to_pids([Tid#tid.pid | [Pid || Pid <-
> get_elements(friends,Store), is_pid(Pid)], Msg).
>
> /Dan
>
> Teemu Antti-Poika wrote:
>> Hello world,
>>
>> we're in the process of adopting Erlang as a part of our project and
>> we've run into a problem with Mnesia's High Availability features. I'd
>> very much appreciate any insights you may have on where we go wrong.
>>
>> Our setup is a two-node cluster and the intention is to have, among
>> other things, replicated mnesia tables. Both nodes read/write from the
>> tables. The assumption is that if one node goes down the other one
>> continues alone until the failover problem is fixed.
>>
>> Our failover tests failed horribly: when one node was taken out, the
>> other died as well. I've managed to isolate the problem to a one-file
>> test below:
>>
>>
>> == 8< === Test code begins
>> -module(jb_testing).
>> -include_lib("stdlib/include/qlc.hrl").
>>
>> -compile(export_all).
>>
>> -record(message, {id,
>> state}).
>>
>> %% run once
>> setup(Nodes) ->
>> create_table(message, [{type, set}, {ram_copies, Nodes},
>> {attributes, record_info(fields,
>> message)}]),
>> ok = mnesia:wait_for_tables([message], 30000).
>>
>> create_table(Table, TableDefinition) ->
>> case mnesia:create_table(Table, TableDefinition) of
>> {atomic, ok} ->
>> error_logger:info_msg("Created table ~p~n", [Table]),
>> ok;
>> {aborted, {already_exists, Table}} ->
>> ok;
>> {aborted, Reason} ->
>> exit(Reason)
>> end.
>>
>> load_mnesia() ->
>> F = fun() ->
>> C = qlc:cursor(qlc:q([M || M <- mnesia:table(message),
>> M#message.state =:= new ])),
>> case qlc:next_answers(C, 1) of
>> [] ->
>> none;
>> [_M] ->
>> none
>> end,
>> ok = qlc:delete_cursor(C)
>>
>> end,
>> mnesia:transaction(F),
>> % timer:sleep(100),
>> load_mnesia().
>>
>> == 8< === Test code ends
>>
>> Here's how I run the code to demonstrate the problem (node names are
>> examples):
>> - On server1: create_schema([egw@REDACTED,
>> egw@REDACTED])
>> - On both nodes: mnesia:start().
>> - On server1: jb_testing:setup([egw@REDACTED,
>> egw@REDACTED]).
>> - On server2: jb_testing:load_mnesia(). This starts busy-looping and
>> creating load for mnesia from current process, i.e. locks up your
>> shell.
>> - On server1: halt().
>>
>> After a short while server2 reports mnesia as crashed. Some sample
>> logging from server2, with mnesia debugging enabled:
>>
>> Erlang R13B (erts-5.7.1) [source] [64-bit] [smp:2:2] [rq:2]
>> [async-threads:0] [hipe] [kernel-poll:false]
>>
>> Eshell V5.7.1 (abort with ^G)
>> (egw@REDACTED)1> mnesia:start().
>> Mnesia('egw@REDACTED'): mnesia_monitor starting: <0.69.0>
>> Mnesia('egw@REDACTED'): Version: "4.4.9"
>> [...lines of non-interesting start-up logging removed for brevity...]
>> Mnesia('egw@REDACTED'): Mnesia started, 0 seconds
>> ok
>> (egw@REDACTED)2> Mnesia('egw@REDACTED'):
>> Transaction log dump skipped (optional): schema_prepare
>> Mnesia('egw@REDACTED'): Logging mnesia_up
>> 'egw@REDACTED'
>>
>> (egw@REDACTED)3> Mnesia('egw@REDACTED'): write
>> performed by {tid,4,<3261.108.0>} on record:
>> {schema,message,
>> [{name,message},
>> {type,set},
>> {ram_copies,['egw@REDACTED',
>> 'egw@REDACTED']},
>> {disc_copies,[]},
>> {disc_only_copies,[]},
>> {load_order,0},
>> {access_mode,read_write},
>> {index,[]},
>> {snmp,[]},
>> {local_content,false},
>> {record_name,message},
>> {attributes,[id,state]},
>> {user_properties,[]},
>> {frag_properties,[]},
>>
>> {cookie,{{1240,390114,462850},'egw@REDACTED'}},
>> {version,{{2,0},[]}}]}
>> Mnesia('egw@REDACTED'): Transaction log dump skipped
>> (optional): schema_prepare
>> Mnesia('egw@REDACTED'): write performed by
>> {tid,4,<3261.108.0>} on record:
>> {schema,message,
>> [{name,message},
>> {type,set},
>> {ram_copies,['egw@REDACTED',
>> 'egw@REDACTED']},
>> {disc_copies,[]},
>> {disc_only_copies,[]},
>> {load_order,0},
>> {access_mode,read_write},
>> {index,[]},
>> {snmp,[]},
>> {local_content,false},
>> {record_name,message},
>> {attributes,[id,state]},
>> {user_properties,[]},
>> {frag_properties,[]},
>>
>> {cookie,{{1240,390114,462850},'egw@REDACTED'}},
>> {version,{{2,0},[]}}]}
>> Mnesia('egw@REDACTED'): Getting table message (ram_copies)
>> from disc: {dumper,
>>
>> create_table}
>>
>> (egw@REDACTED)3>
>> (egw@REDACTED)3> jb_testing:ping_mnesia().
>> Mnesia('egw@REDACTED'): Logging mnesia_down
>> 'egw@REDACTED'
>> Mnesia('egw@REDACTED'): Got mnesia_down from
>> 'egw@REDACTED', reconfiguring...
>> Mnesia('egw@REDACTED'): mnesia_monitor got FATAL ERROR
>> from: <0.73.0>
>>
>> =ERROR REPORT==== 22-Apr-2009::11:49:17 ===
>> Mnesia('egw@REDACTED'): ** ERROR ** (core dumped to file:
>> "/home/jetbet/MnesiaCore.egw@REDACTED")
>> ** FATAL ** mnesia_tm crashed: {badarg,
>> [{mnesia_tm,send_to_pids,2},
>>
>> {mnesia_tm,reconfigure_coordinators,2},
>> {mnesia_tm,doit_loop,1},
>> {mnesia_sp,init_proc,4},
>> {proc_lib,init_p_do_apply,3}]}
>> state: [<0.68.0>]
>> Mnesia('egw@REDACTED'): mnesia_controller terminated:
>> shutdown
>>
>> =ERROR REPORT==== 22-Apr-2009::11:49:27 ===
>> ** Generic server mnesia_monitor terminating
>> ** Last message in was {'EXIT',<0.68.0>,killed}
>> ** When Server state == {state,<0.68.0>,[],
>> ['egw@REDACTED'],
>> true,[],undefined,[]}
>> ** Reason for termination ==
>> ** killed
>>
>> =ERROR REPORT==== 22-Apr-2009::11:49:27 ===
>> ** Generic server mnesia_recover terminating
>> ** Last message in was {'EXIT',<0.68.0>,killed}
>> ** When Server state ==
>> {state,<0.68.0>,undefined,undefined,undefined,0,true,
>> []}
>> ** Reason for termination ==
>> ** killed
>> ** exception exit: killed
>> (egw@REDACTED)4>
>> =INFO REPORT==== 22-Apr-2009::11:49:27 ===
>> application: mnesia
>> exited: killed
>> type: temporary
>>
>>
>> Produced core dump is not recognized by crashdump_viewer ("...is not
>> an Erlang crash dump").
>>
>> Some notes:
>> - Erlang version info: Erlang R13B (erts-5.7.1) [source] [64-bit]
>> [smp:2:2] [rq:2] [async-threads:0] [hipe] [kernel-poll:false]
>> - Same problem occurs with R13A
>> - Our OS is Centos (RedHat) linux, SMP-enabled 64-bit 2.6.18 kernel on
>> Intel.
>>
>> - if we reduce the load for mnesia by introducing sleep (100 ms) in
>> the test loop, the problem goes away or is at least less likely to
>> appear.
>> - we're using RAM tables in the example. Changing tables to
>> disc_copies makes no difference.
>> - changing the data lookup to the form
>> F = fun() ->
>> '$end_of_table' = mnesia:first(message)
>> end,
>> works: halt() on another node does not bring the busy one down.
>>
>> Any ideas?
>>
>> Thank you already in advance,
>> Teemu Antti-Poika
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://www.erlang.org/mailman/listinfo/erlang-questions
>>
>
More information about the erlang-questions
mailing list