Consistent hashing for Mnesia fragments - part 2

Thu Nov 19 15:41:50 CET 2009

Hello, Ulf.

By now, I have reimplemented the module using one of your suggestions
(raw ets accesses to a disc_copies table):

http://igorrs.blogspot.com/2009/11/consistent-hashing-for-mnesia-fragments_19.html

Seems to be working fine now.

Thank you.
Igor.

On Wed, Nov 18, 2009 at 6:45 AM, Ulf Wiger
<ulf.wiger@REDACTED> wrote:
> Igor Ribeiro Sucupira wrote:
>>
>> Hello.
>>
>> I've begun the implementation of a module following the
>> mnesia_frag_hash behaviour.
>> With enough fragments, my hash_state could grow to a reasonable size
>> (something around, say, 10 MB).
>
> Great! I have downloaded your code and will take a look at
> it when time allows (no promises as to when that might be...)
>
> This ought to come in very handy for fragmented tables where
> each fragment has considerably more than 2 GB of data :)
> (one example I have in mind is a sharding layer on top of
> Tokyo Tyrant; in which case it might also be interesting to
> add a Lua "BIF" to the Tyrant table that could calculate
> the new hash on each object in place and only return the
> objects that actually need to be moved to another bucket.)
>
>> Can you think of any significant performance impact of this
>> implementation? For example: is the hash_state transmitted between
>> nodes frequently enough to cause significant network traffic?
>
> It's not so much that which will be a problem. The frag_hash state
> will only be replicated when it's updated, which happens when
> e.g. changing the number of fragments.
>
> The bigger problem is that the frag_hash record is stored in
> ets, and retrieved for each access to the fragmented table
> (i.e. every read and every write). The ets:lookup() will cause
> the 10 MB structure to be copied to the heap of the calling
> process - once for every access!
>
> Another approach might be to create a fully replicated
> ordered_set disc_copy table and store the keys there.
> To speed things up, you can do raw ets accesses to this table,
> but you will have to ensure then that even extra db nodes (that
> don't have to be defined in the schema) are given a copy of
> the table. Otherwise, you have to use at least dirty reads to
> the table. This will provide distribution transparently, but
> carry a bit more overhead than raw ets operations.
>
> A reasonable compromise might be to check once if the table
> exists in ram (e.g using ets:table_info()), then choosing
> whether to use async_dirty or ets in a call to mnesia:activity/2.
>
> ...except this doesn't seem to work as well as one would hope.
> It seems that if you call mnesia:activity/2 inside a transaction,
> it will create a new transaction store and copy into it all objects
> updated in the outer transaction - even if the type of the nested
> transaction is dirty or ets! (in which case the copying will be
> entirely useless, as all the operations in the inner transaction
> will be mapped to dirty or ets counterparts that have no knowledge
> of transaction stores.)
>
> I can understand why this case isn't optimized, as normally you
> deserve a good spanking if you willfully run dirty transactions
> inside non-dirty ones. But in your case, I think it might have
> been a reasonable approach.
>
> You might be left with having to implement your own gb_sets
> equivalents based on ets, one using ets, and one using
> dirty_reads, and then pick the most suitable one depending on
> whether the keys table is local or not.
>
> BR,
> Ulf W
> --
> Ulf Wiger
> CTO, Erlang Training & Consulting Ltd
> http://www.erlang-consulting.com

-- 
"The secret of joy in work is contained in one word - excellence. To
know how to do something well is to enjoy it." - Pearl S. Buck.