[erlang-questions] optimal data structures / db platform
Ulf Wiger
ulf@REDACTED
Sun Jun 10 16:11:19 CEST 2007
One thing to consider is to store the docs in ordered_set
tables, where the key reflects the hierarchy.
For a while (long ago), I contemplated building an XML
database like this. The key might be something like
[DocId,Chap,SubChap1,SubChap11], etc.
If the subheadings are ordered, a simple select on
{#doc{key = [DocId|'_'], _ = '_'}, [], ['$_']} retrieves the
whole document in order, and as the first part of the key
is bound, this will be a pretty efficient search.
As you inevitably take the world by storm and grow out
of a single box, you can fragment the database by use
of a custom version of the callback module mnesia_frag_hash.
http://www.erlang.org/doc/doc-5.5.4/lib/mnesia-4.3.4/doc/html/mnesia_frag_hash.html
A simple way to do it would be to copy mnesia_frag_hash, and
make only a slight modification of key_to_frag_num:
key_to_frag_number(State, [Key|_]) when record(State, hash_state) ->
% ^^^^^^ use only the head of the key
L = State#hash_state.n_doubles,
A = erlang:phash(Key, trunc(math:pow(2, L))),
P = State#hash_state.next_n_to_split,
if
A < P ->
erlang:phash(Key, trunc(math:pow(2, L + 1)));
true ->
A
end.
This will cause mnesia to select fragment based on document ID,
keeping the whole document inside a single fragment.
Not that this is a complete solution by any means,
just a suggestion.
BR,
Ulf W
2007/6/10, Brad Anderson <brad@REDACTED>:
> Hi all,
>
> I'm an Erlang newb and I'm contemplating building a system where various
> documents need to be aggregated. A document is basically key/value pairs that
> hold different elements of information. My question revolves around which
> Erlang data structures I should use, and if Mnesia or a more traditional RDBMS
> like PostgreSQL should be employed.
>
> Each document might be more or less complete than the others in the entire
> collection (more or fewer key/value pairs present), they change over time, and
> can be hierarchical in nature. These requirements lead me to a list or tuple
> or some such, and away from a RDBMS.
>
> 1. Erlang / Mnesia pseudo-code:
> -record(doc, {
> location,
> week, %% or month, or some other time dimension
> {doc_data, [{key1, val1},
> {key2, val2},
> ...
> {keyN, valN}]
> }
> }).
>
>
> 2. RDBMS schema (horizontal):
>
> +----------+------+------+------+-----+------+
> | location | week | val1 | val2 | ... | valN |
> +----------+------+------+------+-----+------+
>
> (difficult to 'change over time' and be 'hierarchical')
>
>
> 3. RDBMS schema (vertical):
>
> +----------+------+-----+-----+
> | location | week | key | val |
> +----------+------+-----+-----+
>
> or two tables:
>
> +--------+----------+------+ +--------+-----+-------+
> | doc_id | location | week | and | doc_id | key | value |
> +--------+----------+------+ +--------+-----+-------+
>
> (potential query inefficiency, difficult WHERE constraints for filtering out
> missing values if need be, or filtering out stores based on data values - a
> feature that would require touching each doc! )
>
>
> I'm guessing most of the aggregations that are needed are sums and averages.
>
> Week and location will be used to narrow the set down, based on a time period
> and characteristics of the locations. However, as mentioned above, sometimes
> locations are eliminated because of data values in the doc, like, say, "show
> me only the top 20% locations for the key1 data value." Maybe key1 == sales
> or something.
>
> Questions:
> ----------
> Which one should I use? or are there alternative structures in Erlang that I
> haven't listed?
>
> How would my decision change as N grew? I'm not sure what the overall
> population of documents will be, but you gotta dare to dream that the world
> will eat this up en masse :) Millions or billions of docs would be cool. I'm
> aware of limits in Mnesia tables, but frankly, for performance, I'd be
> partitioning the RDBMS tables as I would Mnesia ones.
>
> Does Mnesia's in-memory, distributed, fault-tolerant, Erlang data structure
> nature far surpass the RDBMS's more rigid structure, but long history of
> optimization?
>
> Would the Erlang / Mnesia approach plus a mapreduce type of system, spread
> across many boxes help tilt the scales away from the RDBMS?
>
> Sorry for the long post. It's kind of an important decision ;)
>
> Cheers,
> Brad
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://www.erlang.org/mailman/listinfo/erlang-questions
>
More information about the erlang-questions
mailing list