[erlang-questions] Needed: Great big ordered int set.
Richard Carlsson
carlsson.richard@REDACTED
Mon Jul 8 22:31:49 CEST 2013
On 2013-07-08 20:21, Jachym Holecek wrote:
> You could experiment with packing your fixed-width integers into binaries. Keep
> integers sorted so you can bsearch on given binary chunk.
There is no point in using binaries to "pack" numbers. On a 64-bit
Erlang VM, the majority of those integers will be a fixnum anyway,
represented by a single 64-bit word. The remaining ones will be encoded
as bignums, which is just as efficient as using binaries (and both
simpler and faster).
I tried building an ordered_set ETS table of {N}-tuples, and it used 7
words per entry. For 500M entries, it would need 28 GB, so if you have
32 GB, it could work for you. Advantages: simple, and also fast.
I assume that these numbers are something like hashes, so that the
distribution is completely random, and since the keyspace is 64 bits,
you can't use tricks like bitmaps. In fact, the array module can be a
pretty space-efficient representation of this kind of data, and is even
reasonably fast.
The following (which populates the array with 'true' for all N in the
list, but doesn't count the lists:seq/1 call as part of the execution
time) took about 4 seconds on my machine for 1M entries:
1> ((fun(L) -> element(1, timer:tc(fun () -> lists:foldl(fun (X, A) ->
array:set(X, true, A) end, array:new(), L) end))
end)(lists:seq(1,1000000))).
4008000
Taking the data structure size in words instead of the time:
2> ((fun(L) -> erts_debug:flat_size(timer:tc(fun () -> lists:foldl(fun
(X, A) -> array:set(X, true, A) end, array:new(), L) end))
end)(lists:seq(1,1000000))).
1233424
it uses 9 MB for 1M entries. I upped the size to 100M entries and could
comfortably populate the array in a few minutes on my home PC, resulting
in a data structure of size 123M words (940 MB).
My example of course uses a dense data set, and so it has the smallest
possible representation. The array module handles sparse arrays pretty
well, but if entries tend to be spaced 10 or more apart and are rarely
clustered close to each other, it will use about 10 times as much space
as a dense array, so it might or might not suite you; try it and see.
Uninitialized entries will simply return 'undefined', and there are
traversal functions that will skip uninitialized entries for you.
/Richard
More information about the erlang-questions
mailing list