[erlang-questions] Problems with Mnesia and failure situations

Teemu Antti-Poika anttipoi@REDACTED
Wed Apr 22 16:47:30 CEST 2009


[Oops, forgot to CC the list - sending a copy]

Thanks, Dan!

This solves the issue. This database has an excellent support response time :)

Apparently mnesia timing out on node A when I unplug/shutdown node B
is a necessary evil? Since my process controlling the mnesia bound
data is timing out on remote communications, it cannot react to orders
to dump tables on disk. Eventually it is forcefully shut down by the
supervisor, causing RAM-resident data be lost, I think?

Teemu

On Wed, Apr 22, 2009 at 3:34 PM, Dan Gudmundsson <dgud@REDACTED> wrote:
>
> Note to self compile the code before sending patches all over the world.
> A tested version will be included in the next patch release :-)
>
> /Dan
>
> faenor:mnesia> ct diff -diff_format -pre src/mnesia_tm.erl
> 2197c2197
> <     send_to_pids([Tid#tid.pid | get_elements(friends,Store)], Msg).
> ---
>>     send_to_pids([Tid#tid.pid | [Pid || Pid <-
>> get_elements(friends,Store), is_pid(Pid)]], Msg).
>
> Dan Gudmundsson wrote:
>>
>> That is bug,  happens when using qlc inside a mnesia_transaction and a
>> another node goes down.
>>
>> Patch:
>>
>> ct diff -diff_format -pre src/mnesia_tm.erl
>> 2197c2197
>> <     send_to_pids([Tid#tid.pid | get_elements(friends,Store)], Msg).
>> ---
>>  >     send_to_pids([Tid#tid.pid | [Pid || Pid <-
>> get_elements(friends,Store), is_pid(Pid)], Msg).
>>
>> /Dan
>>
>> Teemu Antti-Poika wrote:
>>>
>>> Hello world,
>>>
>>> we're in the process of adopting Erlang as a part of our project and
>>> we've run into a problem with Mnesia's High Availability features. I'd
>>> very much appreciate any insights you may have on where we go wrong.
>>>
>>> Our setup is a two-node cluster and the intention is to have, among
>>> other things, replicated mnesia tables. Both nodes read/write from the
>>> tables. The assumption is that if one node goes down the other one
>>> continues alone until the failover problem is fixed.
>>>
>>> Our failover tests failed horribly: when one node was taken out, the
>>> other died as well. I've managed to isolate the problem to a one-file
>>> test below:
>>>
>>>
>>> == 8< === Test code begins
>>> -module(jb_testing).
>>> -include_lib("stdlib/include/qlc.hrl").
>>>
>>> -compile(export_all).
>>>
>>> -record(message, {id,
>>>                  state}).
>>>
>>> %% run once
>>> setup(Nodes) ->
>>>    create_table(message, [{type, set}, {ram_copies, Nodes},
>>>                              {attributes, record_info(fields,
>>> message)}]),
>>>    ok = mnesia:wait_for_tables([message], 30000).
>>>
>>> create_table(Table, TableDefinition) ->
>>>    case mnesia:create_table(Table, TableDefinition) of
>>>        {atomic, ok} ->
>>>            error_logger:info_msg("Created table ~p~n", [Table]),
>>>            ok;
>>>        {aborted, {already_exists, Table}} ->
>>>            ok;
>>>        {aborted, Reason} ->
>>>            exit(Reason)
>>>    end.
>>>
>>> load_mnesia() ->
>>>    F = fun() ->
>>>                C = qlc:cursor(qlc:q([M || M <- mnesia:table(message),
>>> M#message.state =:= new ])),
>>>                case qlc:next_answers(C, 1) of
>>>                    [] ->
>>>                        none;
>>>                    [_M] ->
>>>                        none
>>>                end,
>>>                ok = qlc:delete_cursor(C)
>>>
>>>        end,
>>>    mnesia:transaction(F),
>>> %    timer:sleep(100),
>>>    load_mnesia().
>>>
>>> == 8< === Test code ends
>>>
>>> Here's how I run the code to demonstrate the problem (node names are
>>> examples):
>>> - On server1: create_schema([egw@REDACTED,
>>> egw@REDACTED])
>>> - On both nodes: mnesia:start().
>>> - On server1: jb_testing:setup([egw@REDACTED,
>>> egw@REDACTED]).
>>> - On server2: jb_testing:load_mnesia(). This starts busy-looping and
>>> creating load for mnesia from current process, i.e. locks up your
>>> shell.
>>> - On server1: halt().
>>>
>>> After a short while server2 reports mnesia as crashed. Some sample
>>> logging from server2, with mnesia debugging enabled:
>>>
>>> Erlang R13B (erts-5.7.1) [source] [64-bit] [smp:2:2] [rq:2]
>>> [async-threads:0] [hipe] [kernel-poll:false]
>>>
>>> Eshell V5.7.1  (abort with ^G)
>>> (egw@REDACTED)1> mnesia:start().
>>> Mnesia('egw@REDACTED'): mnesia_monitor starting: <0.69.0>
>>> Mnesia('egw@REDACTED'): Version: "4.4.9"
>>> [...lines of non-interesting start-up logging removed for brevity...]
>>> Mnesia('egw@REDACTED'): Mnesia started, 0 seconds
>>> ok
>>> (egw@REDACTED)2> Mnesia('egw@REDACTED'):
>>> Transaction log dump skipped (optional): schema_prepare
>>> Mnesia('egw@REDACTED'): Logging mnesia_up
>>> 'egw@REDACTED'
>>>
>>> (egw@REDACTED)3> Mnesia('egw@REDACTED'): write
>>> performed by {tid,4,<3261.108.0>} on record:
>>>        {schema,message,
>>>                [{name,message},
>>>                 {type,set},
>>>                 {ram_copies,['egw@REDACTED',
>>>                              'egw@REDACTED']},
>>>                 {disc_copies,[]},
>>>                 {disc_only_copies,[]},
>>>                 {load_order,0},
>>>                 {access_mode,read_write},
>>>                 {index,[]},
>>>                 {snmp,[]},
>>>                 {local_content,false},
>>>                 {record_name,message},
>>>                 {attributes,[id,state]},
>>>                 {user_properties,[]},
>>>                 {frag_properties,[]},
>>>
>>> {cookie,{{1240,390114,462850},'egw@REDACTED'}},
>>>                 {version,{{2,0},[]}}]}
>>> Mnesia('egw@REDACTED'): Transaction log dump skipped
>>> (optional): schema_prepare
>>> Mnesia('egw@REDACTED'): write performed by
>>> {tid,4,<3261.108.0>} on record:
>>>        {schema,message,
>>>                [{name,message},
>>>                 {type,set},
>>>                 {ram_copies,['egw@REDACTED',
>>>                              'egw@REDACTED']},
>>>                 {disc_copies,[]},
>>>                 {disc_only_copies,[]},
>>>                 {load_order,0},
>>>                 {access_mode,read_write},
>>>                 {index,[]},
>>>                 {snmp,[]},
>>>                 {local_content,false},
>>>                 {record_name,message},
>>>                 {attributes,[id,state]},
>>>                 {user_properties,[]},
>>>                 {frag_properties,[]},
>>>
>>> {cookie,{{1240,390114,462850},'egw@REDACTED'}},
>>>                 {version,{{2,0},[]}}]}
>>> Mnesia('egw@REDACTED'): Getting table message (ram_copies)
>>> from disc: {dumper,
>>>
>>>            create_table}
>>>
>>> (egw@REDACTED)3>
>>> (egw@REDACTED)3> jb_testing:ping_mnesia().
>>> Mnesia('egw@REDACTED'): Logging mnesia_down
>>> 'egw@REDACTED'
>>> Mnesia('egw@REDACTED'): Got mnesia_down from
>>> 'egw@REDACTED', reconfiguring...
>>> Mnesia('egw@REDACTED'): mnesia_monitor got FATAL ERROR
>>> from: <0.73.0>
>>>
>>> =ERROR REPORT==== 22-Apr-2009::11:49:17 ===
>>> Mnesia('egw@REDACTED'): ** ERROR ** (core dumped to file:
>>> "/home/jetbet/MnesiaCore.egw@REDACTED")
>>>  ** FATAL ** mnesia_tm crashed: {badarg,
>>>                                    [{mnesia_tm,send_to_pids,2},
>>>
>>> {mnesia_tm,reconfigure_coordinators,2},
>>>                                     {mnesia_tm,doit_loop,1},
>>>                                     {mnesia_sp,init_proc,4},
>>>                                     {proc_lib,init_p_do_apply,3}]}
>>> state: [<0.68.0>]
>>> Mnesia('egw@REDACTED'): mnesia_controller terminated:
>>> shutdown
>>>
>>> =ERROR REPORT==== 22-Apr-2009::11:49:27 ===
>>> ** Generic server mnesia_monitor terminating
>>> ** Last message in was {'EXIT',<0.68.0>,killed}
>>> ** When Server state == {state,<0.68.0>,[],
>>>                               ['egw@REDACTED'],
>>>                               true,[],undefined,[]}
>>> ** Reason for termination ==
>>> ** killed
>>>
>>> =ERROR REPORT==== 22-Apr-2009::11:49:27 ===
>>> ** Generic server mnesia_recover terminating
>>> ** Last message in was {'EXIT',<0.68.0>,killed}
>>> ** When Server state ==
>>> {state,<0.68.0>,undefined,undefined,undefined,0,true,
>>>                               []}
>>> ** Reason for termination ==
>>> ** killed
>>> ** exception exit: killed
>>> (egw@REDACTED)4>
>>> =INFO REPORT==== 22-Apr-2009::11:49:27 ===
>>>    application: mnesia
>>>    exited: killed
>>>    type: temporary
>>>
>>>
>>> Produced core dump is not recognized by crashdump_viewer ("...is not
>>> an Erlang crash dump").
>>>
>>> Some notes:
>>> - Erlang version info: Erlang R13B (erts-5.7.1) [source] [64-bit]
>>> [smp:2:2] [rq:2] [async-threads:0] [hipe] [kernel-poll:false]
>>> - Same problem occurs with R13A
>>> - Our OS is Centos (RedHat) linux, SMP-enabled 64-bit 2.6.18 kernel on
>>> Intel.
>>>
>>> - if we reduce the load for mnesia by introducing sleep (100 ms) in
>>> the test loop, the problem goes away or is at least less likely to
>>> appear.
>>> - we're using RAM tables in the example. Changing tables to
>>> disc_copies makes no difference.
>>> - changing the data lookup to the form
>>>    F = fun() ->
>>>                '$end_of_table' = mnesia:first(message)
>>>        end,
>>>  works: halt() on another node does not bring the busy one down.
>>>
>>> Any ideas?
>>>
>>> Thank you already in advance,
>>> Teemu Antti-Poika
>>> _______________________________________________
>>> erlang-questions mailing list
>>> erlang-questions@REDACTED
>>> http://www.erlang.org/mailman/listinfo/erlang-questions
>>>
>>
>



More information about the erlang-questions mailing list