[erlang-questions] Using ETS for large amounts of data?

Ulf Wiger ulf.wiger@REDACTED
Mon Aug 9 22:26:01 CEST 2010


The better function for checking the size of a data object
is erts_debug:flat_size(Data), although it will give
misleading results if Data contains binaries.

It returns the size in heap words (4 bytes in 32-bit Erlang
and 8 in 64-bit).

I have stored 5 million records in an ets table
(http://www.erlang.org/pipermail/erlang-questions/2005-November/017728.html)
without any problems.

I'm not sure what you mean exactly when you say you
are storing ascii values. If you happen to be storing
strings, you should be aware that each smallint in an
Erlang list occupies two heap words (8 or 16 bytes depending
on word length). Anyway, the above function will help you
figure out how large your objects are.

You might also want to look at how you move the data from
the file to the ets table. If you pull one line at a time,
you should be ok, but if you pull in all the data first,
then start moving it to ets, it's probably not ets that's
giving you grief. :)

BR,
Ulf W

Anthony Molinaro wrote:
> On Mon, Aug 09, 2010 at 08:44:25PM +0200, Jesper Louis Andersen wrote:
>> On Mon, Aug 9, 2010 at 8:08 PM, Anthony Molinaro
>> <anthonym@REDACTED> wrote:
>>> Hi,
>>>
>>>  I've got some data in a file of the form
>>>
>>> start_integer|end_integer|data_field1|data_field2|..|data_fieldN
>>>
>>> Where N is 12.  Most fields are actually smallish integers.
>>> The total number of entries in the file is 9964217 lines.
>>> and the total file size is 752265457.
>> Some math to get you started:
>>
>> Assume the 12 fields are smallish integers. Integers are *at least* 8
>> bytes in size in our guessing game:
>>
>> 1> Lines = 9964217.
>> 9964217
>> ...
>> 4> Lines * 14 * 8 / (1024 * 1024).
>> 1064.293197631836
>>
>> So a reasonable lower bound on your data is 1Gb. Now, that is assuming
>> we can pack the integers optimally.
> 
> Actually, while they are small integers, in the file they are ascii, so
> 750megs of ascii numbers.  I'm actually attempting to store them as ascii
> currently (just because I was curious if I could store that much data in
> ETS, and figured binary would be smaller anyway, so this gives a better
> upper bound and at least a few of the fields are strings).  But anyway,
> I'm fine with 1 or even 2 Gb of space.
> 
>> A more precise bet on the bound can be had:
>>
>> 1> byte_size(term_to_binary({data, 1, 2, 1, 2, 3, 4, 5, 6, 7, 8, 9,
>> 10, 11, 12})).
>> 38
>>
>> So:
>>
>> 4> Lines * 38 / (1024 * 1024).
>> 361.09947776794434
>>
>> Or perhaps even:
>>
>> 5> byte_size(term_to_binary({data, 1000000, 2000000, 1000000, 2000000,
>> 300, 4000, 5000, 600, 7000, 80000, 90000000, 10000000, 11000000,
>> 12000000})).
>> 80
>>
>> Yielding:
>>
>> 6> Lines * 80 / (1024 * 1024).
>> 760.2094268798828
>>
>>
>> One of the VM-people will definitely be able to shed more light on
>> what the implementation does.
>>
>> The real killer is if your data is much larger than this and nothing
>> is done to compress stored terms in ETS.
> 
> I actually already do some compression by taking common values and having
> some lookup tables (well tuples with all the end values, and use element
> to index the right value).  That's what got me to integer values.  I guess
> the real question I had is is this amount of data even feasible in ETS.
> Has anyone out their tried to store 10 million values in ETS?  What other
> solutions have others tried?
> 
> Thanks for the help,
> 
> -Anthony
> 
> 


-- 
Ulf Wiger
CTO, Erlang Solutions Ltd, formerly Erlang Training & Consulting Ltd
http://www.erlang-solutions.com


More information about the erlang-questions mailing list