[erlang-questions] Using ETS for large amounts of data?

Tue Aug 10 01:13:25 CEST 2010

On Mon, Aug 09, 2010 at 02:22:15PM -0700, Paul Mineiro wrote:
> On Mon, 9 Aug 2010, Anthony Molinaro wrote:
> 
> > Hi,
> >
> >   I've got some data in a file of the form
> >
> > start_integer|end_integer|data_field1|data_field2|..|data_fieldN
> >
> > Where N is 12.  Most fields are actually smallish integers.
> > The total number of entries in the file is 9964217 lines.
> > and the total file size is 752265457.
> >
> > I want to load these into an erlang process and essentially do lookups
> > of the form
> >
> >   given an integer M
> >
> >   if M >= start_integer && M <= end_integer then
> >     return data
> ...
> > So is ets up to this task at all? and if not any suggestions?
> 
> Here's my initial thoughts:
> 
> 1) use an interval tree to encode (start, end) -> index
>    1a) http://en.wikipedia.org/wiki/Interval_tree
>    1b) if your intervals do not overlap you can use ets ordered set for
> this part, but use the beginning of the interval for the key, and then use
> ets:prev/2 to find the entry
> 2) put the rest of the data fields in a huge flat binary and look them up
> by reading the index from the interval tree ... or hey just seek around a
> disk file, and let the OS cache keep the hot spots warm.

Since my intervals are closed and non intersecting, it looks like storing
just the interval start and an index is a win.   I was able to create an
ets file with the starts and random indexes, then save it with ets:tab2file,
then load it quickly and do ets:prev calls.

So now for the data file.  It looks like my structure is actually

4 uint8 integers
5 uint16 integers
1 uint32 integer
2 floats (4 byte precision should be fine)

So fixed length record size of 26 bytes for a total of 259069642 bytes.

Since I'm generating these files from perl, I'll use pack on the encode
side, then the bit syntax to decode, but I'm wondering what the most
efficient way of pulling the individual record out of the large binary
are (I plan to file:read_file/1 to get the binary)?

It is using bit syntax to pull out 26 bytes?

BigBinary = <<_:Index, Record:26>>
Record = <<V1:1, V2:1, V3....>>

or is it better to using something like split_binary/2?

{Before, At} -> split_binary (BigBinary, Index),
At = <<V1:1, V2:1, V3....>>

I was going to try some timing tests, but figured I'd see if anyone has any
idea (I'm using R13B04 if that matters).

Thanks for the help,

-Anthony

-- 
------------------------------------------------------------------------
Anthony Molinaro                           <anthonym@REDACTED>