[erlang-questions] mnesia sync_transactions not fsynced?

Mon Oct 31 15:00:09 CET 2011

On 31 Oct 2011, at 04:39, Jon Watte wrote:

> 
> On Sun, Oct 30, 2011 at 3:59 AM, Dan Gudmundsson <dangud@REDACTED> wrote:
> As far as mnesia is concerned it is logged to disc. It have left
> mnesia call chain and it's nothing
> more mnesia can do, except sync the disk.
> Which is a performance penalty that is not acceptable per transaction.
> 
> 
> "Not acceptable" to who? That's what a durable transaction *is*. For most users of actual databases, it is not acceptable that a (durable, isolated) transaction does *not* sync the disk when it claims to synchronize and its results are globally visible.

Mnesia was never designed to be durable in the same sense as e.g. Oracle et al. If you want to be able to guarantee durability in a single-node installation with rotating disks, you should probably use a raw partition in the first place, and take complete control of memory management, including caching. Most traditional RDBMS had to do this, as replication did not become an option until much later - for example, PostgreSQL didn't introduce synchronous replication until release 9.1, which was released in September 2011.

Mnesia is primarily a ram database, designed for distributed systems. These systems rely more on redundancy than disk-based durability.

While one could imagine adding an option for mnesia to sync the disk, it would not be acceptable if it couldn't be turned off, or at least done only periodically.

The sync(8) man page also includes this caveat:

"On Linux, sync is only guaranteed to schedule the dirty blocks for writing; it can actually take a short time before all the blocks are finally written. The reboot(8) and halt(8)commands take this into account by sleeping for a few seconds after calling sync(2)."

See also http://stackoverflow.com/questions/7897628/does-a-typical-acid-rdbms-sync-to-disk-every-commit

"To achieve high performance, databases will use group commits whereby multiple transactions in a commit cycle will use the same write/sync operation to make all of the transactions durable. This is possible where they are all appending to the same transaction log.

This may mean that the response of an individual commit may be delayed (while waiting for others to join the commit cycle) but the overall throughput is much greater across the whole database because the cost of the write/sync is amortized across multiple transactions. For example, each individual transaction may take 10mS, but thousands of transactions are all able to commit in the same cycle."

A notable difference between Mnesia and e.g. Oracle is that mnesia has _much_ better response times. One contributing factor is that mnesia doesn't delay transactions in order to increase throughput through batching. In the domain where mnesia is used, latency is usually more important than throughput.

also from the Stackoverflow thread:

"This also raises the question of what is durable? Is a single disk durable? Not if the disk fails. Is a RAID array durable? Not if there is a catastrophic RAID corruption. The only guarantee of durability is where transactions are replicated across multiple remote database instances - but not not everybody needs that level of guarantee. Durability should not be considered a binary option but rather as a choice of level of durability."

BR,
Ulf W

Ulf Wiger, CTO, Erlang Solutions, Ltd.
http://erlang-solutions.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20111031/4a874b44/attachment.htm>