[erlang-questions] Using ETS for large amounts of data?

Sat Aug 28 08:41:41 CEST 2010

You have two issues here the initial loading of the data, and the actions
which happen on lookup.

If you use and ETS table, you have to use integer index offsets as the
return value or the binary data will be copied into the ETS on insert and
out again on every lookup match.  The index offset should be fast, but it
is a repeated binary match step that is really unnecessary.

Try the following, you will probably get better overall performance with
other things running and putting pressure on memory:

1) Read in the whole file as a single binary
2) Never append or modify this binary so it doesn't get copied
3) Iterate over it using bit syntax to match 26-byte segments
   - [ Rec || <<Rec:26/binary>> <= FileBinary ]
   - You will get sub-binaries which are pointers into #1
3) Store each sub-binary as the value for a key in a tree or dict

You will probably do better in the long run by storing the intervals in a
tree, dictionary, tuple or array.  Lookups might be slower (although
depending on the key you might be surprised at how little difference there
is), but there should be no memory copying and thus no future garbage
collection pressure. The pre-fetched sub-binaries will never incur a
binary match penalty, nor copying and should avoid all garbage collection.

The underlying large binary will always be present, but it will be stored
in the shared binary heap rather than internal to your process so message
passing the sub-binaries should be small and fast as well (all receivers
on the same node will point to the same underlying large binary in the
single binary heap).  You could keep the lookup tree in a process and send
messages out on requests for lookup.  You just can't distribute it off a
single erlang node, but you should get good benefit on a multi-core
processor.

jay