[erlang-questions] Using ETS for large amounts of data?
Hynek Vychodil
hynek@REDACTED
Mon Aug 30 11:56:42 CEST 2010
I have very similar experience.
1> T = ets:new(foo,[private]).
16400
2> [ets:insert(T, [{X,<<"0123456789ABCDEF">>} ||
X<-lists:seq(1000*(Y-1)+1, 1000*Y)]) || Y <-lists:seq(1,10000)],ok.
ok
3> ets:info(T).
[{memory,161554561},
{owner,<0.32.0>},
{heir,none},
{name,foo},
{size,10000000},
{node,nonode@REDACTED},
{named_table,false},
{type,set},
{keypos,1},
{protection,private}]
4> 161554561*4/1024/1024.
616.2817420959473
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
11598 hynek 20 0 1198m 1.1g 2196 S 0 28.6 1:12.34 beam.smp
Where those 500MB (1.1G - 600M) comes from? It is 50B per one 16B
value. If I put same amount of data in perl hash it take same 1.1GB
which is well known for wasting of memory.
Anyway:
1> T = ets:new(foo,[private]).
16400
2> [ets:insert(T, [{X,[]} || X<-lists:seq(1000*(Y-1)+1, 1000*Y)]) || Y
<-lists:seq(1,10000)],ok.
ok
3> ets:info(T).
[{memory,101554561},
{owner,<0.32.0>},
{heir,none},
{name,foo},
{size,10000000},
{node,nonode@REDACTED},
{named_table,false},
{type,set},
{keypos,1},
{protection,private}]
4> 101554561*4/1024/1024.
387.39990615844727
RES from top is 471MB so it seems better.
1> T = ets:new(foo,[private]).
16400
2> [ets:insert(T, [{X,X*10,(X+1)*10} || X<-lists:seq(1000*(Y-1)+1,
1000*Y)]) || Y <-lists:seq(1,10000)],ok.
ok
3> ets:info(T).
[{memory,111554561},
{owner,<0.32.0>},
{heir,none},
{name,foo},
{size,10000001},
{node,nonode@REDACTED},
{named_table,false},
{type,set},
{keypos,1},
{protection,private}]
4> 111554561*4/1024/1024.
425.54687881469727
RES: 547MB
So keeping it in one big binary and store only pointers will save you
300 - 400 MB of data depending of length of item (16-26B). Anyway
using better tuned k/v storage would be better.
On Mon, Aug 30, 2010 at 12:27 AM, Anthony Molinaro
<anthonym@REDACTED> wrote:
> I'm not sure if I ever posted what I did, I basically created two files
> using perl, one was a binary containing ~10 million 26 byte records, the
> other was a sorted list containing set of 2 integers. The first the start
> of a range, the second an offset into the data file. I then had an escript
> which transformed the index into an ets ordered_set.
>
> I then have a gen_server which loads up the ets table using ets:file2tab/1
> and the binary using file:read_file/1, and store them both in the gen_server
> state. I then have a call which looks up the offset by
>
> {Start, Offset} =
> case ets:lookup (Index, Val) of
> [] ->
> case ets:lookup (Index, ets:prev (Index, Val)) of
> [] -> {0, -1};
> V -> hd (V)
> end;
> V -> hd (V)
> end,
>
> I then lookup the record in the binary via
>
> <<_:Offset/binary, Record:26/binary, _/binary>> = Binary
>
> Thus I only have one copy of the binary, as well as an ets table. Memory
> wise the largest part is the ets table, but it's only about 700-800M.
> Then with the binary at about 250M I actually see something like 1.1G or
> so of memory. I didn't try to find out if just storing the 26 bytes as
> the values in the ets table was more efficient memory wise, but what
> I have now seems to work well.
>
> -Anthony
>
> On Fri, Aug 27, 2010 at 11:41:41PM -0700, jay@REDACTED wrote:
>> You have two issues here the initial loading of the data, and the actions
>> which happen on lookup.
>>
>> If you use and ETS table, you have to use integer index offsets as the
>> return value or the binary data will be copied into the ETS on insert and
>> out again on every lookup match. The index offset should be fast, but it
>> is a repeated binary match step that is really unnecessary.
>>
>> Try the following, you will probably get better overall performance with
>> other things running and putting pressure on memory:
>>
>> 1) Read in the whole file as a single binary
>> 2) Never append or modify this binary so it doesn't get copied
>> 3) Iterate over it using bit syntax to match 26-byte segments
>> - [ Rec || <<Rec:26/binary>> <= FileBinary ]
>> - You will get sub-binaries which are pointers into #1
>> 3) Store each sub-binary as the value for a key in a tree or dict
>>
>> You will probably do better in the long run by storing the intervals in a
>> tree, dictionary, tuple or array. Lookups might be slower (although
>> depending on the key you might be surprised at how little difference there
>> is), but there should be no memory copying and thus no future garbage
>> collection pressure. The pre-fetched sub-binaries will never incur a
>> binary match penalty, nor copying and should avoid all garbage collection.
>>
>> The underlying large binary will always be present, but it will be stored
>> in the shared binary heap rather than internal to your process so message
>> passing the sub-binaries should be small and fast as well (all receivers
>> on the same node will point to the same underlying large binary in the
>> single binary heap). You could keep the lookup tree in a process and send
>> messages out on requests for lookup. You just can't distribute it off a
>> single erlang node, but you should get good benefit on a multi-core
>> processor.
>>
>> jay
>>
>>
>>
>> ________________________________________________________________
>> erlang-questions (at) erlang.org mailing list.
>> See http://www.erlang.org/faq.html
>> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>>
>
> --
> ------------------------------------------------------------------------
> Anthony Molinaro <anthonym@REDACTED>
>
>
> ________________________________________________________________
> erlang-questions (at) erlang.org mailing list.
> See http://www.erlang.org/faq.html
> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>
>
--
--Hynek (Pichi) Vychodil
Analyze your data in minutes. Share your insights instantly. Thrill
your boss. Be a data hero!
Try GoodData now for free: www.gooddata.com
More information about the erlang-questions
mailing list