[erlang-questions] optimal data structures / db platform

Sun Jun 10 16:11:19 CEST 2007

One thing to consider is to store the docs in ordered_set
tables, where the key reflects the hierarchy.

For a while (long ago), I contemplated building an XML
database like this. The key might be something like
[DocId,Chap,SubChap1,SubChap11], etc.

If the subheadings are ordered, a simple select on
{#doc{key = [DocId|'_'], _ = '_'}, [], ['$_']} retrieves the
whole document in order, and as the first part of the key
is bound, this will be a pretty efficient search.

As you inevitably take the world by storm and grow out
of a single box, you can fragment the database by use
of a custom version of the callback module mnesia_frag_hash.

http://www.erlang.org/doc/doc-5.5.4/lib/mnesia-4.3.4/doc/html/mnesia_frag_hash.html

A simple way to do it would be to copy mnesia_frag_hash, and
make only a slight modification of key_to_frag_num:

key_to_frag_number(State, [Key|_]) when record(State, hash_state) ->
                          % ^^^^^^ use only the head of the key
    L = State#hash_state.n_doubles,
    A = erlang:phash(Key, trunc(math:pow(2, L))),
    P = State#hash_state.next_n_to_split,
    if
    A < P ->
        erlang:phash(Key, trunc(math:pow(2, L + 1)));
    true ->
        A
    end.

This will cause mnesia to select fragment based on document ID,
keeping the whole document inside a single fragment.

Not that this is a complete solution by any means,
just a suggestion.

BR,
Ulf W

2007/6/10, Brad Anderson <brad@REDACTED>:
> Hi all,
>
> I'm an Erlang newb and I'm contemplating building a system where various
> documents need to be aggregated.  A document is basically key/value pairs that
> hold different elements of information.  My question revolves around which
> Erlang data structures I should use, and if Mnesia or a more traditional RDBMS
> like PostgreSQL should be employed.
>
> Each document might be more or less complete than the others in the entire
> collection (more or fewer key/value pairs present), they change over time, and
> can be hierarchical in nature.  These requirements lead me to a list or tuple
> or some such, and away from a RDBMS.
>
> 1. Erlang / Mnesia pseudo-code:
> -record(doc, {
>               location,
>               week,       %% or month, or some other time dimension
>               {doc_data, [{key1, val1},
>                           {key2, val2},
>                            ...
>                           {keyN, valN}]
>               }
>              }).
>
>
> 2. RDBMS schema (horizontal):
>
> +----------+------+------+------+-----+------+
> | location | week | val1 | val2 | ... | valN |
> +----------+------+------+------+-----+------+
>
> (difficult to 'change over time' and be 'hierarchical')
>
>
> 3. RDBMS schema (vertical):
>
> +----------+------+-----+-----+
> | location | week | key | val |
> +----------+------+-----+-----+
>
> or two tables:
>
> +--------+----------+------+           +--------+-----+-------+
> | doc_id | location | week |    and    | doc_id | key | value |
> +--------+----------+------+           +--------+-----+-------+
>
> (potential query inefficiency, difficult WHERE constraints for filtering out
> missing values if need be, or filtering out stores based on data values - a
> feature that would require touching each doc! )
>
>
> I'm guessing most of the aggregations that are needed are sums and averages.
>
> Week and location will be used to narrow the set down, based on a time period
> and characteristics of the locations.  However, as mentioned above, sometimes
> locations are eliminated because of data values in the doc, like, say, "show
> me only the top 20% locations for the key1 data value."  Maybe key1 == sales
> or something.
>
> Questions:
> ----------
> Which one should I use? or are there alternative structures in Erlang that I
> haven't listed?
>
> How would my decision change as N grew?  I'm not sure what the overall
> population of documents will be, but you gotta dare to dream that the world
> will eat this up en masse :)  Millions or billions of docs would be cool.  I'm
> aware of limits in Mnesia tables, but frankly, for performance, I'd be
> partitioning the RDBMS tables as I would Mnesia ones.
>
> Does Mnesia's in-memory, distributed, fault-tolerant, Erlang data structure
> nature far surpass the RDBMS's more rigid structure, but long history of
> optimization?
>
> Would the Erlang / Mnesia approach plus a mapreduce type of system, spread
> across many boxes help tilt the scales away from the RDBMS?
>
> Sorry for the long post.  It's kind of an important decision ;)
>
> Cheers,
> Brad
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://www.erlang.org/mailman/listinfo/erlang-questions
>