[erlang-questions] Using ETS for large amounts of data?

Mon Aug 30 11:56:42 CEST 2010

I have very similar experience.

1> T = ets:new(foo,[private]).
16400
2> [ets:insert(T, [{X,<<"0123456789ABCDEF">>} ||
X<-lists:seq(1000*(Y-1)+1, 1000*Y)]) || Y <-lists:seq(1,10000)],ok.
ok
3> ets:info(T).
[{memory,161554561},
 {owner,<0.32.0>},
 {heir,none},
 {name,foo},
 {size,10000000},
 {node,nonode@REDACTED},
 {named_table,false},
 {type,set},
 {keypos,1},
 {protection,private}]
4> 161554561*4/1024/1024.
616.2817420959473

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
11598 hynek     20   0 1198m 1.1g 2196 S    0 28.6   1:12.34 beam.smp

Where those 500MB (1.1G - 600M) comes from? It is 50B per one 16B
value. If I put same amount of data in perl hash it take same 1.1GB
which is well known for wasting of memory.

Anyway:

1> T = ets:new(foo,[private]).
16400
2> [ets:insert(T, [{X,[]} || X<-lists:seq(1000*(Y-1)+1, 1000*Y)]) || Y
<-lists:seq(1,10000)],ok.
ok
3> ets:info(T).
[{memory,101554561},
 {owner,<0.32.0>},
 {heir,none},
 {name,foo},
 {size,10000000},
 {node,nonode@REDACTED},
 {named_table,false},
 {type,set},
 {keypos,1},
 {protection,private}]
4> 101554561*4/1024/1024.
387.39990615844727

RES from top is 471MB so it seems better.

1> T = ets:new(foo,[private]).
16400
2> [ets:insert(T, [{X,X*10,(X+1)*10} || X<-lists:seq(1000*(Y-1)+1,
1000*Y)]) || Y <-lists:seq(1,10000)],ok.
ok
3> ets:info(T).
[{memory,111554561},
 {owner,<0.32.0>},
 {heir,none},
 {name,foo},
 {size,10000001},
 {node,nonode@REDACTED},
 {named_table,false},
 {type,set},
 {keypos,1},
 {protection,private}]
4> 111554561*4/1024/1024.
425.54687881469727

RES: 547MB

So keeping it in one big binary and store only pointers will save you
300 - 400 MB of data depending of length of item (16-26B). Anyway
using better tuned k/v storage would be better.

On Mon, Aug 30, 2010 at 12:27 AM, Anthony Molinaro
<anthonym@REDACTED> wrote:
> I'm not sure if I ever posted what I did, I basically created two files
> using perl, one was a binary containing ~10 million 26 byte records, the
> other was a sorted list containing set of 2 integers.  The first the start
> of a range, the second an offset into the data file.  I then had an escript
> which transformed the index into an ets ordered_set.
>
> I then have a gen_server which loads up the ets table using ets:file2tab/1
> and the binary using file:read_file/1, and store them both in the gen_server
> state.  I then have a call which looks up the offset by
>
> {Start, Offset} =
>  case ets:lookup (Index, Val) of
>    [] ->
>      case ets:lookup (Index, ets:prev (Index, Val)) of
>        [] -> {0, -1};
>        V -> hd (V)
>      end;
>    V -> hd (V)
>  end,
>
> I then lookup the record in the binary via
>
> <<_:Offset/binary, Record:26/binary, _/binary>> = Binary
>
> Thus I only have one copy of the binary, as well as an ets table.  Memory
> wise the largest part is the ets table, but it's only about 700-800M.
> Then with the binary at about 250M I actually see something like 1.1G or
> so of memory.  I didn't try to find out if just storing the 26 bytes as
> the values in the ets table was more efficient memory wise, but what
> I have now seems to work well.
>
> -Anthony
>
> On Fri, Aug 27, 2010 at 11:41:41PM -0700, jay@REDACTED wrote:
>> You have two issues here the initial loading of the data, and the actions
>> which happen on lookup.
>>
>> If you use and ETS table, you have to use integer index offsets as the
>> return value or the binary data will be copied into the ETS on insert and
>> out again on every lookup match.  The index offset should be fast, but it
>> is a repeated binary match step that is really unnecessary.
>>
>> Try the following, you will probably get better overall performance with
>> other things running and putting pressure on memory:
>>
>> 1) Read in the whole file as a single binary
>> 2) Never append or modify this binary so it doesn't get copied
>> 3) Iterate over it using bit syntax to match 26-byte segments
>>    - [ Rec || <<Rec:26/binary>> <= FileBinary ]
>>    - You will get sub-binaries which are pointers into #1
>> 3) Store each sub-binary as the value for a key in a tree or dict
>>
>> You will probably do better in the long run by storing the intervals in a
>> tree, dictionary, tuple or array.  Lookups might be slower (although
>> depending on the key you might be surprised at how little difference there
>> is), but there should be no memory copying and thus no future garbage
>> collection pressure. The pre-fetched sub-binaries will never incur a
>> binary match penalty, nor copying and should avoid all garbage collection.
>>
>> The underlying large binary will always be present, but it will be stored
>> in the shared binary heap rather than internal to your process so message
>> passing the sub-binaries should be small and fast as well (all receivers
>> on the same node will point to the same underlying large binary in the
>> single binary heap).  You could keep the lookup tree in a process and send
>> messages out on requests for lookup.  You just can't distribute it off a
>> single erlang node, but you should get good benefit on a multi-core
>> processor.
>>
>> jay
>>
>>
>>
>> ________________________________________________________________
>> erlang-questions (at) erlang.org mailing list.
>> See http://www.erlang.org/faq.html
>> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>>
>
> --
> ------------------------------------------------------------------------
> Anthony Molinaro                           <anthonym@REDACTED>
>
>
> ________________________________________________________________
> erlang-questions (at) erlang.org mailing list.
> See http://www.erlang.org/faq.html
> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>
>

-- 
--Hynek (Pichi) Vychodil

Analyze your data in minutes. Share your insights instantly. Thrill
your boss.  Be a data hero!
Try GoodData now for free: www.gooddata.com