Mnesia deadlock?

Thu Dec 29 17:28:09 CET 2005

Hi there!

Hope someone out there (Klacke, Dan?) understands something about this:
We seemed to get a deadlock in Mnesia. All mnesia calls on one of the
nodes hung forever.

Setup:

We have a system running on multiple nodes on a single host.
Mnesia version is 4.1.12.

A cron job runs every minute to backup the Mnesia tables spread across
the nodes. (It does this by forking a new beam that does an rpc:call()
towards one of the nodes in the running system, and does an
mnesia:backup/1.) Admittedly, this is very often, but we only have
a small amount of data... :-)

Tables are ram_copies and disk_copies. Some tables reside on one subset
of the nodes, others on different node subsets.

We use the default values of 3 minutes/1000 transactions for the dump
threshold.

Problem:

The system had been running for more than a year with no problems when
suddenly all calls to mnesia on the node that mnesia:backup/1 was called
on timed out.

The error logger output had a number of mnesia overload warnings, both
{dump_log,time_threshold} and {mnesia_tm,message_queue_len,[<value>]}

Most functionality in the system was still intact since we seldom write
to the tables.

This lead to a lot of hanging cron-initiated processes, which eventually
led to a high load/low memory situation that became a problem.

A simple restart of the system fixed the immediate problem.

Analysis:

I guess mnesia is sufficiently well tested by many systems over the years
to not have any obvious bugs in the transaction manager, so we guessed it
might have something to do with the backup, specifically the checkpointing
done by the backup, and the periodical dump that mnesia performs.

Since we don't have a lot of writes when the system is running, dumping was
almost always taking place every third minute.

On the theory that dumping might be messed up by running a backup at the
same time, we setup a test system that did a backup every second and set
the dump_log_time_threshold to 500 milliseconds.

We didn't have to wait more than 5-10 minutes before we got (almost) the
same situation as on the live system. The only difference was that we  
didn't
get the errors from mnesia_tm about queue length.

All mnesia-related calls to the main node (the one performing the backup)
that we tried hung. Including things like mnesia:system_info().

Does anyone have any explanation?

Thanks a lot for any input you might have!
/Adam Aquilon
Cellpoint