[erlang-questions] Using ETS for large amounts of data?

Tue Sep 7 06:59:00 CEST 2010

I wrote:

>
> If you use and ETS table, you have to use integer index offsets as the
> return value or the binary data will be copied into the ETS on insert and
> out again on every lookup match.

And Ryan quickly proved me wrong:

> I think the documentation needs to be updated because I don't witness this
> behavior in R14A.  For example, I can read a 436M CSV file, store it in ETS
> and not see my memory usage go up.  Furthermore, I can then use lookup and
> match 400MB of the binary and also not see the memory go up.

The ETS table is kept in a separate memory space to avoid the garbage
collection overhead.  Whenever an entry is accessed it is copied into the
erlang process space.  In the case of binaries, they can live in the
binary heap which is shared for a node and only the erlang reference
structure copied out.  If the binary is small enough (less than the
overhead of a reference structure, I think 32 bytes on a 32-bit system) it
is cheaper to transfer the binary than the structure reference.  That is
the case when sending binaries from one process to another, I am not sure
with the ETS table since it is a special optimized implementation.

To see the memory usage of ETS copies, you would need to do a lot of
lookups.  Lists, tuples and other structures will be copied on every
lookup, but binaries apparently follow the same pattern as message passing
of binaries.

If you created a process which was used as a replacement for ETS
containing a tree, array or tuple, lookups would end up not requiring
extra memory within the process but would incur a copy when sent as a
message to the other process requesting the lookup (and for binaries, the
behavior would be roughly the same as with ETS tables).  To accrue the
memory savings, the lookup data structure would have to be in the same
process and the data returned from a lookup has to be used as is.

-------

Anthony described his solution using integer offsets, an ETS table and
binary pattern matching to extract a sub-binary from the big binary...

> I didn't try to find out if just storing the 26 bytes as
> the values in the ets table was more efficient memory wise, but what
> I have now seems to work well.

This approach is close enough to optimal.  Pattern matching binaries used
to be slower, but now is quite efficient.  26 byte binaries should get
passed to and from ETS / processes essentially as immediate data since
only binaries bigger than 32 bytes are stored on the binary heap.  If it
meets your performance requirements, you are done until you change the
requirements.

--------

Ryan makes some observations about binary memory usage:

> If I run your example in the
> shell directly vs. it it's own process I get drastically different
> results, memory wise.  It seems when running your example in the
> shell that ERTS holds onto all the binaries.

The shell has a feature to access the results of previous commands (the N
previous results, I forget how many).  Try the following:

32> X = 5 * 5.
25
33> v(32).
25

The function shell:v/1 returns the result value of the expression for the
corresponding prompt number.  Therefore the binaries cannot be discarded
until they roll off the history stack.

Ryan also noted excess memory usage with binary:split:

I believe binary pattern matching is now preferred over split.  If you do
create sub-binaries but do not want all of them, remember that none of the
memory occupied by the underlying large binary can be reclaimed as long as
the sub-binaries reference it.  If you filter the sub-binary list, you
should make a new binary copy for each of the retained sub-binaries to
allow the old large binary to be recycled.

jay