mnesia, large datasets and key range lookups

Tue Jul 13 08:58:11 CEST 2004

This is where I wish mnesia had compound key indexes.  My suggestion
is pretty simple.  Add another attribute for the key and use that:

  -record(time_parameter, {key, time, parameter, value}).

  create_tables() ->
     mnesia:create_table(time_parameter,[{type, set},
                         {disc_only_copies, [node()]},
                         {attributes, record_info(fields, time_parameter)}]).

  make_key(#time_parameter{time=T, parameter=P}) ->
     {T, P}.

  add_sample(V) ->
     V2 = V#time_parameter{ key = make_key(V) },
     mensia:write(V2).

Then match becomes:

  dirty_para_request(Parameter, StartTime, EndTime) ->
      Sel = [{#time_parameter{parameter=Parameter,
                              key='$1', value='_'},
              [{'<',{const,{StartTime,Parameter}},'$1'},
               {'=<','$1',{const,{EndTime,Parameter}}}],
              ['$_']}],
      mnesia:dirty_select(time_parameter, Sel).

Hopefully this will make mnesia use a key range scan.  But note the
above is untested.  :)

I'm basing this on the fact that tuples are sortable.  However because
other parameters can exist for a given time value, I am still filtering
on the parameter attribute during the dirty_select.  Insert performance
is going to slow down as Mnesia must validate if the tuple is already in
the table or not.

Mnesia simply may not be the best database for your application.  You
might find other solutions based off pure dets, the gridfile contrib or
a SQL database to give you higher performance.

Jouni Ryn? <Jouni.Ryno@REDACTED> wrote:
> Dear all
> 
> Just wondering, what one should do:
> 
> My data contains irregularly (in time) taken measurements. At a time
> there can be anything from 1 to 30 something measurements.
> 
> So the first logical start is to define
> -record(time_parameter, {time, parameter, value}).
> 
> then to define mnesia table like this            
> mnesia:create_table(time_parameter,[{type, bag},
>                     {disc_only_copies, [node()]},
>                     {attributes, record_info(fields, time_parameter)}]
> 
> Now each time key can contain several parameters. But as there can be
> thousands of measurements per day, the DB-files will
> become really large. A weeks data took about 800 MBytes ...
> 
> No problem with the disc space, but with the searching of the data (for
> correlating data with the housekeeping parameters). Searching for random
> parameter for a certain time interval
> 
> dirty_para_request(Parameter, StartTime, EndTime) ->
>     Sel = [{#time_parameter{parameter=Parameter,
>                             time='$1', value='_'},
>             [{'<',{const,StartTime},'$1'},{'=<','$1',{const,EndTime}}],
>             ['$_']}],
>     mnesia:dirty_select(time_parameter, Sel).
> 
> will now take about 4 minutes, as Mnesia has to scan trough the full DB.
> Extrapolating this even for a years operation means, that one has to do
> something else.
> 
> So far the only thing I can think about, is to split the db-tables to
> daily tables. Then first select the relevant tables by name (based on
> the time) and make the search from those tables only.
> 
> But are there any other solutions?  

-- 
Shawn.