[erlang-questions] Combining ets and mnesia operations for extreme performance

Thu Sep 11 20:06:02 CEST 2008

Hi all,

This one is for mnesia hackers (Ulf, Joel, ...)

I have a scenario where I am building aggregations over some data, most 
of the time counting things. For the performance I want I cannot use 
mnesia transactions naively, but even dirty_writes would be too slow and 
would not give me some atomicity I need.

A typical scenario would be to atomically apply e.g. 1 million 
increments spread over some hundreds/thousands of keys in some dozen 
tables. This atomicity is from the durability point of view. i.e. either 
all increments are processed and made persistent, or, e.g. if the 
machine crashes before, no update must have been made persistent. I 
don't need to worry about atomicity from the lookup point of view (i.e. 
there would be no problem doing a lookup and reading some intermediate 
value).

This leads to my strategy:
  - using mnesia disc_copies;
  - operating directly on the underlying ets tables; e.g.
    ets:update_counter
  - marking which keys become "dirty";
  - then, in a single mnesia:transaction:
    - ets:lookup the dirty keys
    - mnesia:write these records

Basically, it has been working for me. But as I am doing something I 
shouldn't (updating the ets tables directly), I ask what could go wrong. 
  I thought a bit and could only see one potential problem: that mnesia 
dumps the ets table to the DCD file when I already started a subsequent 
aggregation and have already done some ets operations myself.

Therefore I ask: when exactly does mnesia try to see whether to dump a 
table or not (according to dc_dump_limit)? Can it be after a 
mnesia:transaction finished? How long after?

The "solution" I came up with is:
   - prevent mnesia from dumping ets tables to DCD by putting an 
extremely low value in dc_dump_limit; e.g. 0.000001; btw, can floats be 
used here?
   - take control of the dumping, deciding myself wether to do it for 
each table and if so, doing it after the mnesia:transaction, and before 
starting to mess again with ets tables in the subsequent aggregation;
     - I found out I can do this with mnesia_log:ets2dcd. Is this the 
right way to do it?

The above uses functions which are internal to mnesia and not part of 
the official API, which is not a good thing. I would sugest that it 
would be nice if mnesia exported officially functionality such as the 
above, so that one can take some control, use only some parts of it, or 
combine them in novel ways. In my case, I don't need distribution or 
concurrency control (only 1 process writes to such tables; well in fact 
I have several processes but no two of them write to the same table)), 
and I am using mnesia just for persistence of ets tables with some 
atomicity involved.

Or an alternative point of view: forgetting mnesia, does someone know 
other solutions for persistence of ets tables. I used mnesia as a first 
approach, but I am open to alternatives.

Regards,
Paulo Almeida