mnesia, large datasets and key range lookups

Thu Jul 15 16:22:18 CEST 2004

On Thu, 2004-07-15 at 10:15 +0200, Ulf Wiger wrote:

> One thing you could play around with is to use fragmented tables, with
> one fragment per day, and a customized frag_hash algorithm, selecting
> fragment based on the date. This will allow you to keep a single-table
> view, while being able to narrow your searches based on time.
> 

Fragmented tables really made the trick. My DB from 2 weeks of
measurements (operator playing with the reference instruments on work
days) is now about 900 MB, the biggest fragmented table size about 190
MB. Searching trough this for several random parameters takes about 45
seconds, really worth the effort (4 minutes for a human in front of the
web page to be loaded was a way too long time).

My time stamps are Modified Julian Dates, i.e. floating point days since
2000-01-01. The key to fragmentation code is now simply:

key_to_frag_number(State, Key) when record(State, hash_state) ->
    %% the Key is MJD2000-date
    %% io:fwrite("From ~p to ~w~n", [Key, trunc(Key)]),
    IntKey = trunc(Key).

Today is something like 1645 (did not check), so I do have a lot of
empty tables, but as they are only about 6 Kilobytes each, I don't care.
I did have to increase the process limit for the number of the open
files from the default 1024 (linux 2.6).

About the mnesia_frag: is there some compatibility reason, why I am
supposed to use mnesia:activity(_, _,_, mnesia_frag) to access the DB?
If the table does have fragmented tables, one most likely wants to use
the fragmentation calls to access the data.

And while I'm whining about my lack of understanding, the bench-program
in the mnesia examples directory is supposed to use fragmented tables,
but it does not use those activity calls. I have not tested it (I will,
when I have time), but if I call the normal routines, all the records go
to the first base fragmentation.

thanks to you all for the suggestions
	Jouni

My original problem:

My data contains irregularly (in time) taken measurements. At a time
there can be anything from 1 to 30 something measurements.

So the first logical start is to define
-record(time_parameter, {time, parameter, value}).

then to define mnesia table like this            
mnesia:create_table(time_parameter,[{type, bag},
                    {disc_only_copies, [node()]},
                    {attributes, record_info(fields, time_parameter)}]

Now each time key can contain several parameters. But as there can be
thousands of measurements per day, the DB-files will
become really large. A weeks data took about 800 MBytes ...

No problem with the disc space, but with the searching of the data (for
correlating data with the housekeeping parameters). Searching for random
parameter for a certain time interval

dirty_para_request(Parameter, StartTime, EndTime) ->
    Sel = [{#time_parameter{parameter=Parameter,
                            time='$1', value='_'},
            [{'<',{const,StartTime},'$1'},{'=<','$1',{const,EndTime}}],
            ['$_']}],
    mnesia:dirty_select(time_parameter, Sel).

will now take about 4 minutes, as Mnesia has to scan trough the full DB.
Extrapolating this even for a years operation means, that one has to do
something else.

So far the only thing I can think about, is to split the db-tables to
daily tables. Then first select the relevant tables by name (based on
the time) and make the search from those tables only.

-- 

  Jouni Rynö                            mailto://Jouni.Ryno@fmi.fi/
                                        http://www.geo.fmi.fi/~ryno/
  Finnish Meteorological Institute      http://www.fmi.fi/
  Space Research                        http://www.geo.fmi.fi/
  P.O.BOX 503                           Tel      (+358)-9-19294656
  FIN-00101 Helsinki                    FAX      (+358)-9-19294603
  Finland                               priv-GSM (+358)-50-5302903

  "It's just zeros and ones, it cannot be hard"