[erlang-questions] Using ETS for large amounts of data?

Mon Aug 30 00:27:00 CEST 2010

I'm not sure if I ever posted what I did, I basically created two files
using perl, one was a binary containing ~10 million 26 byte records, the
other was a sorted list containing set of 2 integers.  The first the start
of a range, the second an offset into the data file.  I then had an escript
which transformed the index into an ets ordered_set.

I then have a gen_server which loads up the ets table using ets:file2tab/1
and the binary using file:read_file/1, and store them both in the gen_server
state.  I then have a call which looks up the offset by

{Start, Offset} =
  case ets:lookup (Index, Val) of
    [] ->
      case ets:lookup (Index, ets:prev (Index, Val)) of
        [] -> {0, -1};
        V -> hd (V)
      end;
    V -> hd (V)
  end,

I then lookup the record in the binary via

<<_:Offset/binary, Record:26/binary, _/binary>> = Binary

Thus I only have one copy of the binary, as well as an ets table.  Memory
wise the largest part is the ets table, but it's only about 700-800M.
Then with the binary at about 250M I actually see something like 1.1G or
so of memory.  I didn't try to find out if just storing the 26 bytes as
the values in the ets table was more efficient memory wise, but what
I have now seems to work well.

-Anthony

On Fri, Aug 27, 2010 at 11:41:41PM -0700, jay@REDACTED wrote:
> You have two issues here the initial loading of the data, and the actions
> which happen on lookup.
> 
> If you use and ETS table, you have to use integer index offsets as the
> return value or the binary data will be copied into the ETS on insert and
> out again on every lookup match.  The index offset should be fast, but it
> is a repeated binary match step that is really unnecessary.
> 
> Try the following, you will probably get better overall performance with
> other things running and putting pressure on memory:
> 
> 1) Read in the whole file as a single binary
> 2) Never append or modify this binary so it doesn't get copied
> 3) Iterate over it using bit syntax to match 26-byte segments
>    - [ Rec || <<Rec:26/binary>> <= FileBinary ]
>    - You will get sub-binaries which are pointers into #1
> 3) Store each sub-binary as the value for a key in a tree or dict
> 
> You will probably do better in the long run by storing the intervals in a
> tree, dictionary, tuple or array.  Lookups might be slower (although
> depending on the key you might be surprised at how little difference there
> is), but there should be no memory copying and thus no future garbage
> collection pressure. The pre-fetched sub-binaries will never incur a
> binary match penalty, nor copying and should avoid all garbage collection.
> 
> The underlying large binary will always be present, but it will be stored
> in the shared binary heap rather than internal to your process so message
> passing the sub-binaries should be small and fast as well (all receivers
> on the same node will point to the same underlying large binary in the
> single binary heap).  You could keep the lookup tree in a process and send
> messages out on requests for lookup.  You just can't distribute it off a
> single erlang node, but you should get good benefit on a multi-core
> processor.
> 
> jay
> 
> 
> 
> ________________________________________________________________
> erlang-questions (at) erlang.org mailing list.
> See http://www.erlang.org/faq.html
> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
> 

-- 
------------------------------------------------------------------------
Anthony Molinaro                           <anthonym@REDACTED>