[erlang-questions] Problems with Mnesia and failure situations

Teemu Antti-Poika anttipoi@REDACTED
Wed Apr 22 13:13:03 CEST 2009

Hello world,

we're in the process of adopting Erlang as a part of our project and
we've run into a problem with Mnesia's High Availability features. I'd
very much appreciate any insights you may have on where we go wrong.

Our setup is a two-node cluster and the intention is to have, among
other things, replicated mnesia tables. Both nodes read/write from the
tables. The assumption is that if one node goes down the other one
continues alone until the failover problem is fixed.

Our failover tests failed horribly: when one node was taken out, the
other died as well. I've managed to isolate the problem to a one-file
test below:

== 8< === Test code begins


-record(message, {id,

%% run once
setup(Nodes) ->
    create_table(message, [{type, set}, {ram_copies, Nodes},
                              {attributes, record_info(fields, message)}]),
    ok = mnesia:wait_for_tables([message], 30000).

create_table(Table, TableDefinition) ->
    case mnesia:create_table(Table, TableDefinition) of
        {atomic, ok} ->
            error_logger:info_msg("Created table ~p~n", [Table]),
        {aborted, {already_exists, Table}} ->
        {aborted, Reason} ->

load_mnesia() ->
    F = fun() ->
                C = qlc:cursor(qlc:q([M || M <- mnesia:table(message),
M#message.state =:= new ])),
                case qlc:next_answers(C, 1) of
                    [] ->
                    [_M] ->
                ok = qlc:delete_cursor(C)

%    timer:sleep(100),

== 8< === Test code ends

Here's how I run the code to demonstrate the problem (node names are examples):
- On server1: create_schema([egw@REDACTED,
- On both nodes: mnesia:start().
- On server1: jb_testing:setup([egw@REDACTED,
- On server2: jb_testing:load_mnesia(). This starts busy-looping and
creating load for mnesia from current process, i.e. locks up your
- On server1: halt().

After a short while server2 reports mnesia as crashed. Some sample
logging from server2, with mnesia debugging enabled:

Erlang R13B (erts-5.7.1) [source] [64-bit] [smp:2:2] [rq:2]
[async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.7.1  (abort with ^G)
(egw@REDACTED)1> mnesia:start().
Mnesia('egw@REDACTED'): mnesia_monitor starting: <0.69.0>
Mnesia('egw@REDACTED'): Version: "4.4.9"
[...lines of non-interesting start-up logging removed for brevity...]
Mnesia('egw@REDACTED'): Mnesia started, 0 seconds
(egw@REDACTED)2> Mnesia('egw@REDACTED'):
Transaction log dump skipped (optional): schema_prepare
Mnesia('egw@REDACTED'): Logging mnesia_up 'egw@REDACTED'

(egw@REDACTED)3> Mnesia('egw@REDACTED'): write
performed by {tid,4,<3261.108.0>} on record:
Mnesia('egw@REDACTED'): Transaction log dump skipped
(optional): schema_prepare
Mnesia('egw@REDACTED'): write performed by
{tid,4,<3261.108.0>} on record:
Mnesia('egw@REDACTED'): Getting table message (ram_copies)
from disc: {dumper,


(egw@REDACTED)3> jb_testing:ping_mnesia().
Mnesia('egw@REDACTED'): Logging mnesia_down
Mnesia('egw@REDACTED'): Got mnesia_down from
'egw@REDACTED', reconfiguring...
Mnesia('egw@REDACTED'): mnesia_monitor got FATAL ERROR
from: <0.73.0>

=ERROR REPORT==== 22-Apr-2009::11:49:17 ===
Mnesia('egw@REDACTED'): ** ERROR ** (core dumped to file:
 ** FATAL ** mnesia_tm crashed: {badarg,
state: [<0.68.0>]
Mnesia('egw@REDACTED'): mnesia_controller terminated: shutdown

=ERROR REPORT==== 22-Apr-2009::11:49:27 ===
** Generic server mnesia_monitor terminating
** Last message in was {'EXIT',<0.68.0>,killed}
** When Server state == {state,<0.68.0>,[],
** Reason for termination ==
** killed

=ERROR REPORT==== 22-Apr-2009::11:49:27 ===
** Generic server mnesia_recover terminating
** Last message in was {'EXIT',<0.68.0>,killed}
** When Server state == {state,<0.68.0>,undefined,undefined,undefined,0,true,
** Reason for termination ==
** killed
** exception exit: killed
=INFO REPORT==== 22-Apr-2009::11:49:27 ===
    application: mnesia
    exited: killed
    type: temporary

Produced core dump is not recognized by crashdump_viewer ("...is not
an Erlang crash dump").

Some notes:
- Erlang version info: Erlang R13B (erts-5.7.1) [source] [64-bit]
[smp:2:2] [rq:2] [async-threads:0] [hipe] [kernel-poll:false]
- Same problem occurs with R13A
- Our OS is Centos (RedHat) linux, SMP-enabled 64-bit 2.6.18 kernel on Intel.

- if we reduce the load for mnesia by introducing sleep (100 ms) in
the test loop, the problem goes away or is at least less likely to
- we're using RAM tables in the example. Changing tables to
disc_copies makes no difference.
- changing the data lookup to the form
    F = fun() ->
                '$end_of_table' = mnesia:first(message)
  works: halt() on another node does not bring the busy one down.

Any ideas?

Thank you already in advance,
Teemu Antti-Poika

More information about the erlang-questions mailing list