mnesia, large datasets and key range lookups

Tue Jul 13 11:40:49 CEST 2004

On Tue, 2004-07-13 at 11:13 +0200, Nigel.Head@REDACTED wrote:
> >   make_key(#time_parameter{time=T, parameter=P}) ->
> >      {T, P}.N.
> 
> While I'm not exactly sure what the exact application is, 

Telemetry data from one of the Rosetta instruments :)

> the original post said
> there might be upto about 30 parameters at any given time. Why is it strictly
> necessary to have the parameter itself as part of the key. I would go for making
> the record contain the time (as a key) and a list of parameter values for that
> time. The list can be variable length, of course.
> 

Different telemetry packets can contain the same parameters different
order. Actually, the order is usually the same, but it's only a subset
of the full parameter set.

> This would reduce the number of records by some factor of 10 or so; locating the
> specific parameter you're after would then be some sort of application level
> list search -- didn't ought to be too expensive for a list of max 30 long.
> Chances are you'll be needing other parameters from the same time real soon in
> your processing anyway.
> 
For the list searching I would have to identify the order of the
parameters in each time point. Which is kind of having the raw telemetry
data itself. Exactly, what I try get rid of by using the mnesia :)
Hmm, I could use tuples in the list, like this [{NCSA0005, Value},
{NCSA0010, Value2}, ...]. That could be scanned pretty easy with
lists:keysearch. On the other hand, mnesia does that to me, see below.

The interesting part of this email :) is that the time taken is not so
dependant on the number of the parameters in each time stamp. It depends
(only) from the size of the table itself, as mnesia has to (?, I think)
scan through the whole table to find the keys. It's not an ordered set,
but a bag.

So this search:
Selection is [{{time_parameter,'$1','$2','_'},
               [{'<',{const,1607.54},'$1'},
                {'=<','$1',{const,1607.58}},
                {'orelse',{'=:=',{const,"NCSA0005"},'$2'},
                          {'=:=',{const,"NCSA0010"},'$2'},
                          {'=:=',{const,"NCSA0014"},'$2'},
                          {'=:=',{const,"NCSA0016"},'$2'},
                          {'=:=',{const,"NCSA0018"},'$2'}}],
               ['$_']}]
is not really time dependant on the number of those parameters ($2)
(compared to the total time of the search).
And yes, like you thought, I do need the other parameters real soon :) 

So I think (at the moment), that the best route is to first narrow the
search range by some kind of fragmented tables. I'll let you know ...

regards
	Jouni

-- 

  Jouni Rynö                            mailto://Jouni.Ryno@fmi.fi/
                                        http://www.geo.fmi.fi/~ryno/
  Finnish Meteorological Institute      http://www.fmi.fi/
  Space Research                        http://www.geo.fmi.fi/
  P.O.BOX 503                           Tel      (+358)-9-19294656
  FIN-00101 Helsinki                    FAX      (+358)-9-19294603
  Finland                               priv-GSM (+358)-50-5302903

  "It's just zeros and ones, it cannot be hard"