[erlang-questions] Using ETS for large amounts of data?
Ryan Zezeski
rzezeski@REDACTED
Tue Sep 7 05:31:26 CEST 2010
On Mon, Aug 30, 2010 at 5:56 AM, Hynek Vychodil <hynek@REDACTED> wrote:
> I have very similar experience.
>
> So keeping it in one big binary and store only pointers will save you
> 300 - 400 MB of data depending of length of item (16-26B). Anyway
> using better tuned k/v storage would be better.
>
>
Did you check erlang:memory/0 by any chance? If I run your example in the
shell directly vs. it it's own process I get drastically different results,
memory wise. It seems when running your example in the shell that ERTS
holds onto all the binaries.
36> ets:info(T).
[{memory,161554577},
{owner,<0.31.0>},
{heir,none},
{name,foo2},
{size,10000000},
{node,nonode@REDACTED},
{named_table,false},
{type,set},
{keypos,1},
{protection,private}]
37> erlang:memory().
[{total,2108524432},
{processes,1150232},
{processes_used,1135752},
{system,2107374200},
{atom,619969},
{atom_used,593262},
{binary,646419816},
{code,5920745},
{ets,1452847016}]
Note that I'm running a 64-bit VM. Here is the same information when
running in a separate process spawned in the shell.
21> Pid ! {from,get}.
Info: [{memory,141554577},
{owner,<0.70.0>},
{heir,none},
{name,foo},
{size,10000000},
{node,nonode@REDACTED},
{named_table,false},
{type,set},
{keypos,1},
{protection,protected}]
{from,get}
22> erlang:memory().
[{total,1302212544},
{processes,1245760},
{processes_used,1232080},
{system,1300966784},
{atom,619969},
{atom_used,593175},
{binary,22272},
{code,5920745},
{ets,1292837352}]
Notice that the binary memory varies greatly between the two methods.
I recently ran a bunch of tests on binaries to try to understand their
behavior better. I'm also using them to handle large (500M+) CSV files. I
noticed that if I split the CSV file into a list of lists (i.e. a list of
all the column values) using the binary:split function it consumed a _lot_
of memory. I thought that this would be efficient because after reading the
docs I was under the impression that the sub binaries would simply reference
the bigger binary off-heap but the behavior I noticed seemed to indicate
that depending on the size of the binary resulting from split it might
reside on the heap. If that's actually what was happening, I'm not sure. I
ended up re-writing my functions to transform the CSV a line at a time and
build up a new binary in an accumulator. This performed much better and
uses much less memory. Anyways, this is slightly tangential to the problem
being discussed so maybe I'll just post a new thread.
-Ryan
More information about the erlang-questions
mailing list