[Ericsson AB]

6 Mnesia System Information

6.1 Database Configuration Data

The following two functions can be used to retrieve system information. They are described in detail in the reference manual.

6.2 Core Dumps

If Mnesia malfunctions, system information is dumped to a file named MnesiaCore.Node.When. The type of system information contained in this file can also be generated with the function mnesia_lib:coredump(). If a Mnesia system behaves strangely, it is recommended that a Mnesia core dump file be included in the bug report.

6.3 Dumping Tables

Tables of type ram_copies are by definition stored in memory only. It is possible, however, to dump these tables to disc, either at regular intervals, or before the system is shutdown. The function mnesia:dump_tables(TabList) dumps all replicas of a set of RAM tables to disc. The tables can be accessed while being dumped to disc. To dump the tables to disc all replicas must have the storage type ram_copies.

The table content is placed in a .DCD file on the disc. When the Mnesia system is started, the RAM table will initially be loaded with data from its .DCD file.

6.4 Checkpoints

A checkpoint is a transaction consistent state that spans over one or more tables. When a checkpoint is activated, the system will remember the current content of the set of tables. The checkpoint retains a transaction consistent state of the tables, allowing the tables to be read and updated while the checkpoint is active. A checkpoint is typically used to back up tables to external media, but they are also used internally in Mnesia for other purposes. Each checkpoint is independent and a table may be involved in several checkpoints simultaneously.

Each table retains its old contents in a checkpoint retainer and for performance critical applications, it may be important to realize the processing overhead associated with checkpoints. In a worst case scenario, the checkpoint retainer will consume even more memory than the table itself. Each update will also be slightly slower on those nodes where checkpoint retainers are attached to the tables.

For each table it is possible to choose if there should be one checkpoint retainer attached to all replicas of the table, or if it is enough to have only one checkpoint retainer attached to a single replica. With a single checkpoint retainer per table, the checkpoint will consume less memory, but it will be vulnerable to node crashes. With several redundant checkpoint retainers the checkpoint will survive as long as there is at least one active checkpoint retainer attached to each table.

Checkpoints may be explicitly deactivated with the function mnesia:deactivate_checkpoint(Name), where Name is the name of an active checkpoint. This function returns ok if successful, or {error, Reason} in the case of an error. All tables in a checkpoint must be attached to at least one checkpoint retainer. The checkpoint is automatically de-activated by Mnesia, when any table lacks a checkpoint retainer. This may happen when a node goes down or when a replica is deleted. Use the min and max arguments described below, to control the degree of checkpoint retainer redundancy.

Checkpoints are activated with the function mnesia:activate_checkpoint(Args), where Args is a list of the following tuples:

The mnesia:activate_checkpoint(Args) returns one of the following values:

Name is the name of the checkpoint, and Nodes are the nodes where the checkpoint is known.

A list of active checkpoints can be obtained with the following functions:

6.5 Files

This section describes the internal files which are created and maintained by the Mnesia system, in particular, the workings of the Mnesia log is described.

6.5.1 Start-Up Files

In Chapter 3 we detailed the following pre-requisites for starting Mnesia (refer Chapter 3: Starting Mnesia:

The following example shows how these tasks are performed:

  1. % erl -sname klacke -mnesia dir '"/ldisc/scratch/klacke"'        
    
  2. Erlang (BEAM) emulator version 4.9
     
    Eshell V4.9  (abort with ^G)
    (klacke@gin)1> mnesia:create_schema([node()]).
    ok
    (klacke@gin)2> 
    ^Z
    Suspended        
    
    We can inspect the Mnesia directory to see what files have been created. Enter the following command:
    % ls -l /ldisc/scratch/klacke
    -rw-rw-r--   1 klacke   staff       247 Aug 12 15:06 FALLBACK.BUP        
    
    The response shows that the file FALLBACK.BUP has been created. This is called a backup file, and it contains an initial schema. If we had specified more than one node in the mnesia:create_schema/1 function, identical backup files would have been created on all nodes.
  3. Continue by starting Mnesia:
    (klacke@gin)3>mnesia:start( ).
    ok        
    
    We can now see the following listing in the Mnesia directory:
    -rw-rw-r--   1 klacke   staff         86 May 26 19:03 LATEST.LOG
    -rw-rw-r--   1 klacke   staff      34507 May 26 19:03 schema.DAT        
    
    The schema in the backup file FALLBACK.BUP has been used to generate the file schema.DAT. Since we have no other disc resident tables than the schema, no other data files were created. The file FALLBACK.BUP was removed after the successful "restoration". We also see a number of files that are for internal use by Mnesia.
  4. Enter the following command to create a table:
    (klacke@gin)4> mnesia:create_table(foo,[{disc_copies, [node()]}]).
    {atomic,ok}        
    
    We can now see the following listing in the Mnesia directory:
    % ls -l /ldisc/scratch/klacke
    -rw-rw-r-- 1 klacke staff    86 May 26 19:07 LATEST.LOG
    -rw-rw-r-- 1 klacke staff    94 May 26 19:07 foo.DCD
    -rw-rw-r-- 1 klacke staff  6679 May 26 19:07 schema.DAT        
    
    Where a file foo.DCD has been created. This file will eventually store all data that is written into the foo table.

6.5.2 The Log File

When starting Mnesia, a .LOG file called LATEST.LOG was created and placed in the database directory. This file is used by Mnesia to log disc based transactions. This includes all transactions that write at least one record in a table which is of storage type disc_copies, or disc_only_copies. It also includes all operations which manipulate the schema itself, such as creating new tables. The format of the log can vary with different implementations of Mnesia. The Mnesia log is currently implemented with the standard library module disc_log.

The log file will grow continuously and must be dumped at regular intervals. "Dumping the log file" means that Mnesia will perform all the operations listed in the log and place the records in the corresponding .DAT, .DCD and .DCL data files. For example, if the operation "write record {foo, 4, elvis, 6}" is listed in the log, Mnesia inserts the operation into the file foo.DCL, later when Mnesia thinks the .DCL has become to large the data is moved to the .DCD file. The dumping operation can be time consuming if the log is very large. However, it is important to realize that the Mnesia system continues to operate during log dumps.

By default Mnesia either dumps the log whenever 100 records have been written in the log or when 3 minutes have passed. This is controlled by the two application parameters -mnesia dump_log_write_threshold WriteOperations and -mnesia dump_log_time_threshold MilliSecs.

Before the log is dumped, the file LATEST.LOG is renamed to PREVIOUS.LOG, and a new LATEST.LOG file is created. Once the log has been successfully dumped, the file PREVIOUS.LOG is deleted.

The log is also dumped at start-up and whenever a schema operation is performed.

6.5.3 The Data Files

The directory listing also contains one .DAT file. This contain the schema itself, contained in the schema.DAT file. The DAT files are indexed files, and it is efficient to insert and search for records in these files with a specific key. The .DAT files are used for the schema and for disc_only_copies tables. The Mnesia data files are currently implemented with the standard library module dets, and all operations which can be performed on dets files can also be performed on the Mnesia data files. For example, dets contains a function dets:traverse/2 which can be used to view the contents of a Mnesia DAT file. However, this can only be done when Mnesia is not running. So, to view a our schema file, we can:

{ok, N} = dets:open_file(schema, [{file, "./schema.DAT"},{repair,false}, 
{keypos, 2}]),
F = fun(X) -> io:format("~p~n", [X]), continue end,
dets:traverse(N, F),
dets:close(N).      
Note

Refer to the Reference Manual, std_lib for information about dets.

Warning

The DAT files must always be opened with the {repair, false} option. This ensures that these files are not automatically repaired. Without this option, the database may become inconsistent, because Mnesia may believe that the files were properly closed. Refer to the reference manual for information about the configuration parameter auto_repair.

Warning

It is recommended that Data files are not tampered with while Mnesia is running. While not prohibited, the behavior of Mnesia is unpredictable.

The disc_copies tables are stored on disk with .DCL and .DCD files, which are standard disk_log files.

6.6 Loading of Tables at Start-up

At start-up Mnesia loads tables in order to make them accessible for its applications. Sometimes Mnesia decides to load all tables that reside locally, and sometimes the tables may not be accessible until Mnesia brings a copy of the table from another node.

To understand the behavior of Mnesia at start-up it is essential to understand how Mnesia reacts when it loses contact with Mnesia on another node. At this stage, Mnesia cannot distinguish between a communication failure and a "normal" node down.
When this happens, Mnesia will assume that the other node is no longer running. Whereas, in reality, the communication between the nodes has merely failed.

To overcome this situation, simply try to restart the ongoing transactions that are accessing tables on the failing node, and write a mnesia_down entry to a log file.

At start-up, it must be noted that all tables residing on nodes without a mnesia_down entry, may have fresher replicas. Their replicas may have been updated after the termination of Mnesia on the current node. In order to catch up with the latest updates, transfer a copy of the table from one of these other "fresh" nodes. If you are unlucky, other nodes may be down and you must wait for the table to be loaded on one of these nodes before receiving a fresh copy of the table.

Before an application makes its first access to a table, mnesia:wait_for_tables(TabList, Timeout) ought to be executed to ensure that the table is accessible from the local node. If the function times out the application may choose to force a load of the local replica with mnesia:force_load_table(Tab) and deliberately lose all updates that may have been performed on the other nodes while the local node was down. If Mnesia already has loaded the table on another node or intends to do so, we will copy the table from that node in order to avoid unnecessary inconsistency.

Warning

Keep in mind that it is only one table that is loaded by mnesia:force_load_table(Tab) and since committed transactions may have caused updates in several tables, the tables may now become inconsistent due to the forced load.

The allowed AccessMode of a table may be defined to either be read_only or read_write. And it may be toggled with the function mnesia:change_table_access_mode(Tab, AccessMode) in runtime. read_only tables and local_content tables will always be loaded locally, since there are no need for copying the table from other nodes. Other tables will primary be loaded remotely from active replicas on other nodes if the table already has been loaded there, or if the running Mnesia already has decided to load the table there.

At start up, Mnesia will assume that its local replica is the most recent version and load the table from disc if either situation is detected:

This is normally a wise decision, but it may turn out to be disastrous if the nodes have been disconnected due to a communication failure, since Mnesia's normal table load mechanism does not cope with communication failures.

When Mnesia is loading many tables the default load order. However, it is possible to affect the load order by explicitly changing the load_order property for the tables, with the function mnesia:change_table_load_order(Tab, LoadOrder). The LoadOrder is by default 0 for all tables, but it can be set to any integer. The table with the highest load_order will be loaded first. Changing load order is especially useful for applications that need to ensure early availability of fundamental tables. Large peripheral tables should have a low load order value, perhaps set below 0.

6.7 Recovery from Communication Failure

There are several occasions when Mnesia may detect that the network has been partitioned due to a communication failure.

One is when Mnesia already is up and running and the Erlang nodes gain contact again. Then Mnesia will try to contact Mnesia on the other node to see if it also thinks that the network has been partitioned for a while. If Mnesia on both nodes has logged mnesia_down entries from each other, Mnesia generates a system event, called {inconsistent_database, running_partitioned_network, Node} which is sent to Mnesia's event handler and other possible subscribers. The default event handler reports an error to the error logger.

Another occasion when Mnesia may detect that the network has been partitioned due to a communication failure, is at start-up. If Mnesia detects that both the local node and another node received mnesia_down from each other it generates a {inconsistent_database, starting_partitioned_network, Node} system event and acts as described above.

If the application detects that there has been a communication failure which may have caused an inconsistent database, it may use the function mnesia:set_master_nodes(Tab, Nodes) to pinpoint from which nodes each table may be loaded.

At start-up Mnesia's normal table load algorithm will be bypassed and the table will be loaded from one of the master nodes defined for the table, regardless of potential mnesia_down entries in the log. The Nodes may only contain nodes where the table has a replica and if it is empty, the master node recovery mechanism for the particular table will be reset and the normal load mechanism will be used when next restarting.

The function mnesia:set_master_nodes(Nodes) sets master nodes for all tables. For each table it will determine its replica nodes and invoke mnesia:set_master_nodes(Tab, TabNodes) with those replica nodes that are included in the Nodes list (i.e. TabNodes is the intersection of Nodes and the replica nodes of the table). If the intersection is empty the master node recovery mechanism for the particular table will be reset and the normal load mechanism will be used at next restart.

The functions mnesia:system_info(master_node_tables) and mnesia:table_info(Tab, master_nodes) may be used to obtain information about the potential master nodes.

The function mnesia:force_load_table(Tab) may be used to force load the table regardless of which table load mechanism is activated.

6.8 Recovery of Transactions

A Mnesia table may reside on one or more nodes. When a table is updated, Mnesia will ensure that the updates will be replicated to all nodes where the table resides. If a replica happens to be inaccessible for some reason (e.g. due to a temporary node down), Mnesia will then perform the replication later.

On the node where the application is started, there will be a transaction coordinator process. If the transaction is distributed, there will also be a transaction participant process on all the other nodes where commit work needs to be performed.

Internally Mnesia uses several commit protocols. The selected protocol depends on which table that has been updated in the transaction. If all the involved tables are symmetrically replicated, (i.e. they all have the same ram_nodes, disc_nodes and disc_only_nodes currently accessible from the coordinator node), a lightweight transaction commit protocol is used.

The number of messages that the transaction coordinator and its participants needs to exchange is few, since Mnesia's table load mechanism takes care of the transaction recovery if the commit protocol gets interrupted. Since all involved tables are replicated symmetrically the transaction will automatically be recovered by loading the involved tables from the same node at start-up of a failing node. We do not really care if the transaction was aborted or committed as long as we can ensure the ACID properties. The lightweight commit protocol is non-blocking, i.e. the surviving participants and their coordinator will finish the transaction, regardless of some node crashes in the middle of the commit protocol or not.

If a node goes down in the middle of a dirty operation the table load mechanism will ensure that the update will be performed on all replicas or none. Both asynchronous dirty updates and synchronous dirty updates use the same recovery principle as lightweight transactions.

If a transaction involves updates of asymmetrically replicated tables or updates of the schema table, a heavyweight commit protocol will be used. The heavyweight commit protocol is able to finish the transaction regardless of how the tables are replicated. The typical usage of a heavyweight transaction is when we want to move a replica from one node to another. Then we must ensure that the replica either is entirely moved or left as it was. We must never end up in a situation with replicas on both nodes or no node at all. Even if a node crashes in the middle of the commit protocol, the transaction must be guaranteed to be atomic. The heavyweight commit protocol involves more messages between the transaction coordinator and its participants than a lightweight protocol and it will perform recovery work at start-up in order to finish the abort or commit work.

The heavyweight commit protocol is also non-blocking, which allows the surviving participants and their coordinator to finish the transaction regardless (even if a node crashes in the middle of the commit protocol). When a node fails at start-up, Mnesia will determine the outcome of the transaction and recover it. Lightweight protocols, heavyweight protocols and dirty updates, are dependent on other nodes to be up and running in order to make the correct heavyweight transaction recovery decision.

If Mnesia has not started on some of the nodes that are involved in the transaction AND neither the local node or any of the already running nodes know the outcome of the transaction, Mnesia will by default wait for one. In the worst case scenario all other involved nodes must start before Mnesia can make the correct decision about the transaction and finish its start-up.

This means that Mnesia (on one node)may hang if a double fault occurs, i.e. when two nodes crash simultaneously and one attempts to start when the other refuses to start e.g. due to a hardware error.

It is possible to specify the maximum time that Mnesia will wait for other nodes to respond with a transaction recovery decision. The configuration parameter max_wait_for_decision defaults to infinity (which may cause the indefinite hanging as mentioned above) but if it is set to a definite time period (eg.three minutes), Mnesia will then enforce a transaction recovery decision if needed, in order to allow Mnesia to continue with its start-up procedure.

The downside of an enforced transaction recovery decision, is that the decision may be incorrect, due to insufficient information regarding the other nodes' recovery decisions. This may result in an inconsistent database where Mnesia has committed the transaction on some nodes but aborted it on others.

In fortunate cases the inconsistency will only appear in tables belonging to a specific application, but if a schema transaction has been inconsistently recovered due to the enforced transaction recovery decision, the effects of the inconsistency can be fatal. However, if the higher priority is availability rather than consistency, then it may be worth the risk.

If Mnesia encounters a inconsistent transaction decision a {inconsistent_database, bad_decision, Node} system event will be generated in order to give the application a chance to install a fallback or other appropriate measures to resolve the inconsistency. The default behavior of the Mnesia event handler is the same as if the database became inconsistent as a result of partitioned network (see above).

6.9 Backup, Fallback, and Disaster Recovery

The following functions are used to backup data, to install a backup as fallback, and for disaster recovery.

These functions are explained in the following sub-sections. Also refer to the the section Checkpoints in this chapter, which describes the two functions used to activate and de-activate checkpoints.

6.9.1 Backup

Backup operation are performed with the following functions:

By default, the actual access to the backup media is performed via the mnesia_backup module for both read and write. Currently mnesia_backup is implemented with the standard library module disc_log, but it is possible to write your own module with the same interface as mnesia_backup and configure Mnesia so the alternate module performs the actual accesses to the backup media. This means that the user may put the backup on medias that Mnesia does not know about, possibly on hosts where Erlang is not running. Use the configuration parameter -mnesia backup_module <module> for this purpose.

The source for a backup is an activated checkpoint. The backup function most commonly used is mnesia:backup_checkpoint(Name, Opaque,[Mod]). This function returns either ok, or {error,Reason}. It has the following arguments:

The function mnesia:backup(Opaque[, Mod]) activates a new checkpoint which covers all Mnesia tables with maximum degree of redundancy and performs a backup. Maximum redundancy means that each table replica has a checkpoint retainer. Tables with the local_contents property are backed up as they look on the current node.

It is possible to iterate over a backup, either for the purpose of transforming it into a new backup, or just reading it. The function mnesia:traverse_backup(Source, [SourceMod,]Target, [TargeMod,] Fun, Acc) which normally returns {ok, LastAcc}, is used for both of these purposes.

Before the traversal starts, the source backup media is opened with SourceMod:open_read(Source), and the target backup media is opened with TargetMod:open_write(Target). The arguments are:

It is also possible to perform a read-only traversal of the source backup without updating a target backup. If TargetMod==read_only, then no target backup is accessed at all.

By setting SourceMod and TargetMod to different modules it is possible to copy a backup from one kind of backup media to another.

Valid BackupItems are the following tuples:

The backup data is divided into two sections. The first section contains information related to the schema. All schema related items are tuples where the first field equals the atom schema. The second section is the record section. It is not possible to mix schema records with other records and all schema records must be located first in the backup.

The schema itself is a table and will possibly be included in the backup. All nodes where the schema table resides are regarded as a db_node.

The following example illustrates how mnesia:traverse_backup can be used to rename a db_node in a backup file:

change_node_name(Mod, From, To, Source, Target) ->
    Switch =
        fun(Node) when Node == From -> To;
           (Node) when Node == To -> throw({error, already_exists});
           (Node) -> Node
        end,
    Convert =
        fun({schema, db_nodes, Nodes}, Acc) ->
                {[{schema, db_nodes, lists:map(Switch,Nodes)}], Acc};
           ({schema, version, Version}, Acc) ->
                {[{schema, version, Version}], Acc};
           ({schema, cookie, Cookie}, Acc) ->
                {[{schema, cookie, Cookie}], Acc};
           ({schema, Tab, CreateList}, Acc) ->
                Keys = [ram_copies, disc_copies, disc_only_copies],
                OptSwitch =
                    fun({Key, Val}) ->
                            case lists:member(Key, Keys) of
                                true -> {Key, lists:map(Switch, Val)};
                                false-> {Key, Val}
                            end
                    end,
                {[{schema, Tab, lists:map(OptSwitch, CreateList)}], Acc};
           (Other, Acc) ->
                {[Other], Acc}
        end,
    mnesia:traverse_backup(Source, Mod, Target, Mod, Convert, switched).

view(Source, Mod) ->
    View = fun(Item, Acc) ->
                   io:format("~p.~n",[Item]),
                   {[Item], Acc + 1}
           end,
    mnesia:traverse_backup(Source, Mod, dummy, read_only, View, 0).

6.9.2 Restore

Tables can be restored on-line from a backup without restarting Mnesia. A restore is performed with the function mnesia:restore(Opaque,Args), where Args can contain the following tuples:

The argument Opaque is forwarded to the backup module. It returns {atomic, TabList} if successful, or the tuple {aborted, Reason} in the case of an error. TabList is a list of the restored tables. Tables which are restored are write locked for the duration of the restore operation. However, regardless of any lock conflict caused by this, applications can continue to do their work during the restore operation.

The restoration is performed as a single transaction. If the database is very large, it may not be possible to restore it online. In such a case the old database must be restored by installing a fallback, and then restart.

6.9.3 Fallbacks

The function mnesia:install_fallback(Opaque, [Mod]) is used to install a backup as fallback. It uses the backup module Mod, or the default backup module, to access the backup media. This function returns ok if successful, or {error, Reason} in the case of an error.

Installing a fallback is a distributed operation that is only performed on all db_nodes. The fallback is used to restore the database the next time the system is started. If a Mnesia node with a fallback installed detects that Mnesia on another node has died for some reason, it will unconditionally terminate itself.

A fallback is typically used when a system upgrade is performed. A system typically involves the installation of new software versions, and Mnesia tables are often transformed into new layouts. If the system crashes during an upgrade, it is highly probable re-installation of the old applications will be required and restoration of the database to its previous state. This can be done if a backup is performed and installed as a fallback before the system upgrade begins.

If the system upgrade fails, Mnesia must be restarted on all db_nodes in order to restore the old database. The fallback will be automatically de-installed after a successful start-up. The function mnesia:uninstall_fallback() may also be used to de-install the fallback after a successful system upgrade. Again, this is a distributed operation that is either performed on all db_nodes, or none. Both the installation and de-installation of fallbacks require Erlang to be up and running on all db_nodes, but it does not matter if Mnesia is running or not.

6.9.4 Disaster Recovery

The system may become inconsistent as a result of a power failure. The UNIX fsck feature can possibly repair the file system, but there is no guarantee that the file contents will be consistent.

If Mnesia detects that a file has not been properly closed, possibly as a result of a power failure, it will attempt to repair the bad file in a similar manner. Data may be lost, but Mnesia can be restarted even if the data is inconsistent. The configuration parameter -mnesia auto_repair <bool> can be used to control the behavior of Mnesia at start-up. If <bool> has the value true, Mnesia will attempt to repair the file; if <bool> has the value false, Mnesia will not restart if it detects a suspect file. This configuration parameter affects the repair behavior of log files, DAT files, and the default backup media.

The configuration parameter -mnesia dump_log_update_in_place <bool> controls the safety level of the mnesia:dump_log() function. By default, Mnesia will dump the transaction log directly into the DAT files. If a power failure happens during the dump, this may cause the randomly accessed DAT files to become corrupt. If the parameter is set to false, Mnesia will copy the DAT files and target the dump to the new temporary files. If the dump is successful, the temporary files will be renamed to their normal DAT suffixes. The possibility for unrecoverable inconsistencies in the data files will be much smaller with this strategy. On the other hand, the actual dumping of the transaction log will be considerably slower. The system designer must decide whether speed or safety is the higher priority.

Replicas of type disc_only_copies will only be affected by this parameter during the initial dump of the log file at start-up. When designing applications which have very high requirements, it may be appropriate not to use disc_only_copies tables at all. The reason for this is the random access nature of normal operating system files. If a node goes down for reason for a reason such as a power failure, these files may be corrupted because they are not properly closed. The DAT files for disc_only_copies are updated on a per transaction basis.

If a disaster occurs and the Mnesia database has been corrupted, it can be reconstructed from a backup. This should be regarded as a last resort, since the backup contains old data. The data is hopefully consistent, but data will definitely be lost when an old backup is used to restore the database.


mnesia 4.4.3
Copyright © 1991-2008 Ericsson AB