mnesia power index -- What about fragmentation?

Sun Feb 20 18:06:31 CET 2005

Den 2005-02-19 10:12:02 skrev Valentin Micic <valentin@REDACTED>:

> Speaking of indexes...
> Am I wrong in saying that indexing for fragmented tables can
> be improoved? I did a brief test once (on R9) and notice that
> fragment number allocation for index table is the same as for
> key -- in other words, if KEY says that record goes to fragment
> number 5, the index is also going to be stored in the index
> fragment number 5. This causes a linear performance degradation
> -- more fragments one has, worse index-based lookup gets.
> Wouldn't it be better if index fragments are calculated based
> index value? Hmmm... this is too obvious. It must be that was
> doing something wrong? Was I?

I don't think you were doing anything wrong. Consider the
source code in mnesia_frag.erl for index_read():

"index_read(ActivityId, Opaque, Tab, Key, Attr, LockKind) ->
     Match =
	[mnesia:index_read(ActivityId, Opaque, Frag, Key, Attr, LockKind)
	     || Frag <- frag_names(Tab)],
     lists:append(Match)."

This will result in an index lookup on each node where there's
a fragment. In cases where the index lookup actually would
result in fetches from each node, this is quite efficient,
but for all other cases, it's suboptimal, I think.

But when you consider failure situations, managing index tables
any other way becomes a bit complex. One would perhaps want to
have fragmented indexes where distribution is determined by a
frag_hash callback, in a manner similar to what's done with the
real objects, but then you may run into situations where a node
housing index keys to another node suddenly goes down, and
you end up with only parts of the table indexed.

When I spent some time thinking about it (and I hadn't really
given it any thought before you brought it up), I came to the
conclusion that one may have to manage index tables as first-class
tables, in order to get good cluster properties. Then, you could
specify one fragmentation algorithm for the tables, and another
for each index; and you could replicate index tables, so that
you'd be protected from single node crashes.

That would require quite a lot of rewriting in mnesia, though,
and there would be a prize to pay in overhead for several
functions on normal tables, I think.

I'm sure Håkan and Dan have given this much more thought than
I have.

/Uffe