[erlang-questions] Impact of state size in mnesia_frag_hash

Wed Nov 18 09:45:05 CET 2009

Igor Ribeiro Sucupira wrote:
> Hello.
> 
> I've begun the implementation of a module following the
> mnesia_frag_hash behaviour.
> With enough fragments, my hash_state could grow to a reasonable size
> (something around, say, 10 MB).

Great! I have downloaded your code and will take a look at
it when time allows (no promises as to when that might be...)

This ought to come in very handy for fragmented tables where
each fragment has considerably more than 2 GB of data :)
(one example I have in mind is a sharding layer on top of
Tokyo Tyrant; in which case it might also be interesting to
add a Lua "BIF" to the Tyrant table that could calculate
the new hash on each object in place and only return the
objects that actually need to be moved to another bucket.)

> Can you think of any significant performance impact of this
> implementation? For example: is the hash_state transmitted between
> nodes frequently enough to cause significant network traffic?

It's not so much that which will be a problem. The frag_hash state
will only be replicated when it's updated, which happens when
e.g. changing the number of fragments.

The bigger problem is that the frag_hash record is stored in
ets, and retrieved for each access to the fragmented table
(i.e. every read and every write). The ets:lookup() will cause
the 10 MB structure to be copied to the heap of the calling
process - once for every access!

Another approach might be to create a fully replicated
ordered_set disc_copy table and store the keys there.
To speed things up, you can do raw ets accesses to this table,
but you will have to ensure then that even extra db nodes (that
don't have to be defined in the schema) are given a copy of
the table. Otherwise, you have to use at least dirty reads to
the table. This will provide distribution transparently, but
carry a bit more overhead than raw ets operations.

A reasonable compromise might be to check once if the table
exists in ram (e.g using ets:table_info()), then choosing
whether to use async_dirty or ets in a call to mnesia:activity/2.

...except this doesn't seem to work as well as one would hope.
It seems that if you call mnesia:activity/2 inside a transaction,
it will create a new transaction store and copy into it all objects
updated in the outer transaction - even if the type of the nested
transaction is dirty or ets! (in which case the copying will be
entirely useless, as all the operations in the inner transaction
will be mapped to dirty or ets counterparts that have no knowledge
of transaction stores.)

I can understand why this case isn't optimized, as normally you
deserve a good spanking if you willfully run dirty transactions
inside non-dirty ones. But in your case, I think it might have
been a reasonable approach.

You might be left with having to implement your own gb_sets
equivalents based on ets, one using ets, and one using
dirty_reads, and then pick the most suitable one depending on
whether the keys table is local or not.

BR,
Ulf W
-- 
Ulf Wiger
CTO, Erlang Training & Consulting Ltd
http://www.erlang-consulting.com