[erlang-questions] optimal data structures / db platform

Sun Jun 10 07:12:40 CEST 2007

Hi all,

I'm an Erlang newb and I'm contemplating building a system where various
documents need to be aggregated.  A document is basically key/value pairs that
hold different elements of information.  My question revolves around which
Erlang data structures I should use, and if Mnesia or a more traditional RDBMS
like PostgreSQL should be employed.

Each document might be more or less complete than the others in the entire
collection (more or fewer key/value pairs present), they change over time, and
can be hierarchical in nature.  These requirements lead me to a list or tuple
or some such, and away from a RDBMS.

1. Erlang / Mnesia pseudo-code:
-record(doc, {
              location,
              week,       %% or month, or some other time dimension
              {doc_data, [{key1, val1},
                          {key2, val2},
                           ...
                          {keyN, valN}]
              }
             }).

2. RDBMS schema (horizontal):

+----------+------+------+------+-----+------+
| location | week | val1 | val2 | ... | valN |
+----------+------+------+------+-----+------+

(difficult to 'change over time' and be 'hierarchical')

3. RDBMS schema (vertical):

+----------+------+-----+-----+
| location | week | key | val |
+----------+------+-----+-----+

or two tables:

+--------+----------+------+           +--------+-----+-------+
| doc_id | location | week |    and    | doc_id | key | value |
+--------+----------+------+           +--------+-----+-------+

(potential query inefficiency, difficult WHERE constraints for filtering out
missing values if need be, or filtering out stores based on data values - a
feature that would require touching each doc! )

I'm guessing most of the aggregations that are needed are sums and averages.

Week and location will be used to narrow the set down, based on a time period
and characteristics of the locations.  However, as mentioned above, sometimes
locations are eliminated because of data values in the doc, like, say, "show
me only the top 20% locations for the key1 data value."  Maybe key1 == sales
or something.

Questions:
----------
Which one should I use? or are there alternative structures in Erlang that I
haven't listed?

How would my decision change as N grew?  I'm not sure what the overall
population of documents will be, but you gotta dare to dream that the world
will eat this up en masse :)  Millions or billions of docs would be cool.  I'm
aware of limits in Mnesia tables, but frankly, for performance, I'd be
partitioning the RDBMS tables as I would Mnesia ones.

Does Mnesia's in-memory, distributed, fault-tolerant, Erlang data structure
nature far surpass the RDBMS's more rigid structure, but long history of
optimization?

Would the Erlang / Mnesia approach plus a mapreduce type of system, spread
across many boxes help tilt the scales away from the RDBMS?

Sorry for the long post.  It's kind of an important decision ;)

Cheers,
Brad