[erlang-questions] Problems with Mnesia and failure situations

Dan Gudmundsson <>
Wed Apr 22 14:03:02 CEST 2009


That is bug,  happens when using qlc inside a mnesia_transaction and a another node goes down.

Patch:

ct diff -diff_format -pre src/mnesia_tm.erl
2197c2197
<     send_to_pids([Tid#tid.pid | get_elements(friends,Store)], Msg).
---
 >     send_to_pids([Tid#tid.pid | [Pid || Pid <- get_elements(friends,Store), 
is_pid(Pid)], Msg).

/Dan

Teemu Antti-Poika wrote:
> Hello world,
> 
> we're in the process of adopting Erlang as a part of our project and
> we've run into a problem with Mnesia's High Availability features. I'd
> very much appreciate any insights you may have on where we go wrong.
> 
> Our setup is a two-node cluster and the intention is to have, among
> other things, replicated mnesia tables. Both nodes read/write from the
> tables. The assumption is that if one node goes down the other one
> continues alone until the failover problem is fixed.
> 
> Our failover tests failed horribly: when one node was taken out, the
> other died as well. I've managed to isolate the problem to a one-file
> test below:
> 
> 
> == 8< === Test code begins
> -module(jb_testing).
> -include_lib("stdlib/include/qlc.hrl").
> 
> -compile(export_all).
> 
> -record(message, {id,
>                   state}).
> 
> %% run once
> setup(Nodes) ->
>     create_table(message, [{type, set}, {ram_copies, Nodes},
>                               {attributes, record_info(fields, message)}]),
>     ok = mnesia:wait_for_tables([message], 30000).
> 
> create_table(Table, TableDefinition) ->
>     case mnesia:create_table(Table, TableDefinition) of
>         {atomic, ok} ->
>             error_logger:info_msg("Created table ~p~n", [Table]),
>             ok;
>         {aborted, {already_exists, Table}} ->
>             ok;
>         {aborted, Reason} ->
>             exit(Reason)
>     end.
> 
> load_mnesia() ->
>     F = fun() ->
>                 C = qlc:cursor(qlc:q([M || M <- mnesia:table(message),
> M#message.state =:= new ])),
>                 case qlc:next_answers(C, 1) of
>                     [] ->
>                         none;
>                     [_M] ->
>                         none
>                 end,
>                 ok = qlc:delete_cursor(C)
> 
>         end,
>     mnesia:transaction(F),
> %    timer:sleep(100),
>     load_mnesia().
> 
> == 8< === Test code ends
> 
> Here's how I run the code to demonstrate the problem (node names are examples):
> - On server1: create_schema([,
> ])
> - On both nodes: mnesia:start().
> - On server1: jb_testing:setup([,
> ]).
> - On server2: jb_testing:load_mnesia(). This starts busy-looping and
> creating load for mnesia from current process, i.e. locks up your
> shell.
> - On server1: halt().
> 
> After a short while server2 reports mnesia as crashed. Some sample
> logging from server2, with mnesia debugging enabled:
> 
> Erlang R13B (erts-5.7.1) [source] [64-bit] [smp:2:2] [rq:2]
> [async-threads:0] [hipe] [kernel-poll:false]
> 
> Eshell V5.7.1  (abort with ^G)
> ()1> mnesia:start().
> Mnesia(''): mnesia_monitor starting: <0.69.0>
> Mnesia(''): Version: "4.4.9"
> [...lines of non-interesting start-up logging removed for brevity...]
> Mnesia(''): Mnesia started, 0 seconds
> ok
> ()2> Mnesia(''):
> Transaction log dump skipped (optional): schema_prepare
> Mnesia(''): Logging mnesia_up ''
> 
> ()3> Mnesia(''): write
> performed by {tid,4,<3261.108.0>} on record:
>         {schema,message,
>                 [{name,message},
>                  {type,set},
>                  {ram_copies,['',
>                               '']},
>                  {disc_copies,[]},
>                  {disc_only_copies,[]},
>                  {load_order,0},
>                  {access_mode,read_write},
>                  {index,[]},
>                  {snmp,[]},
>                  {local_content,false},
>                  {record_name,message},
>                  {attributes,[id,state]},
>                  {user_properties,[]},
>                  {frag_properties,[]},
>                  {cookie,{{1240,390114,462850},''}},
>                  {version,{{2,0},[]}}]}
> Mnesia(''): Transaction log dump skipped
> (optional): schema_prepare
> Mnesia(''): write performed by
> {tid,4,<3261.108.0>} on record:
>         {schema,message,
>                 [{name,message},
>                  {type,set},
>                  {ram_copies,['',
>                               '']},
>                  {disc_copies,[]},
>                  {disc_only_copies,[]},
>                  {load_order,0},
>                  {access_mode,read_write},
>                  {index,[]},
>                  {snmp,[]},
>                  {local_content,false},
>                  {record_name,message},
>                  {attributes,[id,state]},
>                  {user_properties,[]},
>                  {frag_properties,[]},
>                  {cookie,{{1240,390114,462850},''}},
>                  {version,{{2,0},[]}}]}
> Mnesia(''): Getting table message (ram_copies)
> from disc: {dumper,
> 
>             create_table}
> 
> ()3>
> ()3> jb_testing:ping_mnesia().
> Mnesia(''): Logging mnesia_down
> ''
> Mnesia(''): Got mnesia_down from
> '', reconfiguring...
> Mnesia(''): mnesia_monitor got FATAL ERROR
> from: <0.73.0>
> 
> =ERROR REPORT==== 22-Apr-2009::11:49:17 ===
> Mnesia(''): ** ERROR ** (core dumped to file:
> "/home/jetbet/")
>  ** FATAL ** mnesia_tm crashed: {badarg,
>                                     [{mnesia_tm,send_to_pids,2},
>                                      {mnesia_tm,reconfigure_coordinators,2},
>                                      {mnesia_tm,doit_loop,1},
>                                      {mnesia_sp,init_proc,4},
>                                      {proc_lib,init_p_do_apply,3}]}
> state: [<0.68.0>]
> Mnesia(''): mnesia_controller terminated: shutdown
> 
> =ERROR REPORT==== 22-Apr-2009::11:49:27 ===
> ** Generic server mnesia_monitor terminating
> ** Last message in was {'EXIT',<0.68.0>,killed}
> ** When Server state == {state,<0.68.0>,[],
>                                [''],
>                                true,[],undefined,[]}
> ** Reason for termination ==
> ** killed
> 
> =ERROR REPORT==== 22-Apr-2009::11:49:27 ===
> ** Generic server mnesia_recover terminating
> ** Last message in was {'EXIT',<0.68.0>,killed}
> ** When Server state == {state,<0.68.0>,undefined,undefined,undefined,0,true,
>                                []}
> ** Reason for termination ==
> ** killed
> ** exception exit: killed
> ()4>
> =INFO REPORT==== 22-Apr-2009::11:49:27 ===
>     application: mnesia
>     exited: killed
>     type: temporary
> 
> 
> Produced core dump is not recognized by crashdump_viewer ("...is not
> an Erlang crash dump").
> 
> Some notes:
> - Erlang version info: Erlang R13B (erts-5.7.1) [source] [64-bit]
> [smp:2:2] [rq:2] [async-threads:0] [hipe] [kernel-poll:false]
> - Same problem occurs with R13A
> - Our OS is Centos (RedHat) linux, SMP-enabled 64-bit 2.6.18 kernel on Intel.
> 
> - if we reduce the load for mnesia by introducing sleep (100 ms) in
> the test loop, the problem goes away or is at least less likely to
> appear.
> - we're using RAM tables in the example. Changing tables to
> disc_copies makes no difference.
> - changing the data lookup to the form
>     F = fun() ->
>                 '$end_of_table' = mnesia:first(message)
>         end,
>   works: halt() on another node does not bring the busy one down.
> 
> Any ideas?
> 
> Thank you already in advance,
> Teemu Antti-Poika
> _______________________________________________
> erlang-questions mailing list
> 
> http://www.erlang.org/mailman/listinfo/erlang-questions
> 



More information about the erlang-questions mailing list