[erlang-questions] Using ETS for large amounts of data?

Anthony Molinaro anthonym@REDACTED
Mon Aug 9 21:15:33 CEST 2010


On Mon, Aug 09, 2010 at 08:44:25PM +0200, Jesper Louis Andersen wrote:
> On Mon, Aug 9, 2010 at 8:08 PM, Anthony Molinaro
> <anthonym@REDACTED> wrote:
> > Hi,
> >
> >  I've got some data in a file of the form
> >
> > start_integer|end_integer|data_field1|data_field2|..|data_fieldN
> >
> > Where N is 12.  Most fields are actually smallish integers.
> > The total number of entries in the file is 9964217 lines.
> > and the total file size is 752265457.
> 
> Some math to get you started:
> 
> Assume the 12 fields are smallish integers. Integers are *at least* 8
> bytes in size in our guessing game:
> 
> 1> Lines = 9964217.
> 9964217
> ...
> 4> Lines * 14 * 8 / (1024 * 1024).
> 1064.293197631836
> 
> So a reasonable lower bound on your data is 1Gb. Now, that is assuming
> we can pack the integers optimally.

Actually, while they are small integers, in the file they are ascii, so
750megs of ascii numbers.  I'm actually attempting to store them as ascii
currently (just because I was curious if I could store that much data in
ETS, and figured binary would be smaller anyway, so this gives a better
upper bound and at least a few of the fields are strings).  But anyway,
I'm fine with 1 or even 2 Gb of space.

> A more precise bet on the bound can be had:
> 
> 1> byte_size(term_to_binary({data, 1, 2, 1, 2, 3, 4, 5, 6, 7, 8, 9,
> 10, 11, 12})).
> 38
> 
> So:
> 
> 4> Lines * 38 / (1024 * 1024).
> 361.09947776794434
> 
> Or perhaps even:
> 
> 5> byte_size(term_to_binary({data, 1000000, 2000000, 1000000, 2000000,
> 300, 4000, 5000, 600, 7000, 80000, 90000000, 10000000, 11000000,
> 12000000})).
> 80
> 
> Yielding:
> 
> 6> Lines * 80 / (1024 * 1024).
> 760.2094268798828
> 
> 
> One of the VM-people will definitely be able to shed more light on
> what the implementation does.
> 
> The real killer is if your data is much larger than this and nothing
> is done to compress stored terms in ETS.

I actually already do some compression by taking common values and having
some lookup tables (well tuples with all the end values, and use element
to index the right value).  That's what got me to integer values.  I guess
the real question I had is is this amount of data even feasible in ETS.
Has anyone out their tried to store 10 million values in ETS?  What other
solutions have others tried?

Thanks for the help,

-Anthony


-- 
------------------------------------------------------------------------
Anthony Molinaro                           <anthonym@REDACTED>


More information about the erlang-questions mailing list