[erlang-questions] Using ETS for large amounts of data?

Tue Sep 7 05:31:26 CEST 2010

On Mon, Aug 30, 2010 at 5:56 AM, Hynek Vychodil <hynek@REDACTED> wrote:

> I have very similar experience.
>
> So keeping it in one big binary and store only pointers will save you
> 300 - 400 MB of data depending of length of item (16-26B). Anyway
> using better tuned k/v storage would be better.
>
>
Did you check erlang:memory/0 by any chance?  If I run your example in the
shell directly vs. it it's own process I get drastically different results,
memory wise.  It seems when running your example in the shell that ERTS
holds onto all the binaries.

36> ets:info(T).
[{memory,161554577},
 {owner,<0.31.0>},
 {heir,none},
 {name,foo2},
 {size,10000000},
 {node,nonode@REDACTED},
 {named_table,false},
 {type,set},
 {keypos,1},
 {protection,private}]
37> erlang:memory().
[{total,2108524432},
 {processes,1150232},
 {processes_used,1135752},
 {system,2107374200},
 {atom,619969},
 {atom_used,593262},
 {binary,646419816},
 {code,5920745},
 {ets,1452847016}]

Note that I'm running a 64-bit VM.  Here is the same information when
running in a separate process spawned in the shell.

21> Pid ! {from,get}.
Info: [{memory,141554577},
       {owner,<0.70.0>},
       {heir,none},
       {name,foo},
       {size,10000000},
       {node,nonode@REDACTED},
       {named_table,false},
       {type,set},
       {keypos,1},
       {protection,protected}]
{from,get}
22> erlang:memory().
[{total,1302212544},
 {processes,1245760},
 {processes_used,1232080},
 {system,1300966784},
 {atom,619969},
 {atom_used,593175},
 {binary,22272},
 {code,5920745},
 {ets,1292837352}]

Notice that the binary memory varies greatly between the two methods.

I recently ran a bunch of tests on binaries to try to understand their
behavior better.  I'm also using them to handle large (500M+) CSV files.  I
noticed that if I split the CSV file into a list of lists (i.e. a list of
all the column values) using the binary:split function it consumed a _lot_
of memory.  I thought that this would be efficient because after reading the
docs I was under the impression that the sub binaries would simply reference
the bigger binary off-heap but the behavior I noticed seemed to indicate
that depending on the size of the binary resulting from split it might
reside on the heap.  If that's actually what was happening, I'm not sure.  I
ended up re-writing my functions to transform the CSV a line at a time and
build up a new binary in an accumulator.  This performed much better and
uses much less memory.  Anyways, this is slightly tangential to the problem
being discussed so maybe I'll just post a new thread.

-Ryan