[erlang-questions] Using ETS for large amounts of data?
Anthony Molinaro
anthonym@REDACTED
Mon Aug 30 00:27:00 CEST 2010
I'm not sure if I ever posted what I did, I basically created two files
using perl, one was a binary containing ~10 million 26 byte records, the
other was a sorted list containing set of 2 integers. The first the start
of a range, the second an offset into the data file. I then had an escript
which transformed the index into an ets ordered_set.
I then have a gen_server which loads up the ets table using ets:file2tab/1
and the binary using file:read_file/1, and store them both in the gen_server
state. I then have a call which looks up the offset by
{Start, Offset} =
case ets:lookup (Index, Val) of
[] ->
case ets:lookup (Index, ets:prev (Index, Val)) of
[] -> {0, -1};
V -> hd (V)
end;
V -> hd (V)
end,
I then lookup the record in the binary via
<<_:Offset/binary, Record:26/binary, _/binary>> = Binary
Thus I only have one copy of the binary, as well as an ets table. Memory
wise the largest part is the ets table, but it's only about 700-800M.
Then with the binary at about 250M I actually see something like 1.1G or
so of memory. I didn't try to find out if just storing the 26 bytes as
the values in the ets table was more efficient memory wise, but what
I have now seems to work well.
-Anthony
On Fri, Aug 27, 2010 at 11:41:41PM -0700, jay@REDACTED wrote:
> You have two issues here the initial loading of the data, and the actions
> which happen on lookup.
>
> If you use and ETS table, you have to use integer index offsets as the
> return value or the binary data will be copied into the ETS on insert and
> out again on every lookup match. The index offset should be fast, but it
> is a repeated binary match step that is really unnecessary.
>
> Try the following, you will probably get better overall performance with
> other things running and putting pressure on memory:
>
> 1) Read in the whole file as a single binary
> 2) Never append or modify this binary so it doesn't get copied
> 3) Iterate over it using bit syntax to match 26-byte segments
> - [ Rec || <<Rec:26/binary>> <= FileBinary ]
> - You will get sub-binaries which are pointers into #1
> 3) Store each sub-binary as the value for a key in a tree or dict
>
> You will probably do better in the long run by storing the intervals in a
> tree, dictionary, tuple or array. Lookups might be slower (although
> depending on the key you might be surprised at how little difference there
> is), but there should be no memory copying and thus no future garbage
> collection pressure. The pre-fetched sub-binaries will never incur a
> binary match penalty, nor copying and should avoid all garbage collection.
>
> The underlying large binary will always be present, but it will be stored
> in the shared binary heap rather than internal to your process so message
> passing the sub-binaries should be small and fast as well (all receivers
> on the same node will point to the same underlying large binary in the
> single binary heap). You could keep the lookup tree in a process and send
> messages out on requests for lookup. You just can't distribute it off a
> single erlang node, but you should get good benefit on a multi-core
> processor.
>
> jay
>
>
>
> ________________________________________________________________
> erlang-questions (at) erlang.org mailing list.
> See http://www.erlang.org/faq.html
> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>
--
------------------------------------------------------------------------
Anthony Molinaro <anthonym@REDACTED>
More information about the erlang-questions
mailing list