[erlang-questions] Needed: Great big ordered int set.

Richard Carlsson carlsson.richard@REDACTED
Mon Jul 8 22:31:49 CEST 2013


On 2013-07-08 20:21, Jachym Holecek wrote:
> You could experiment with packing your fixed-width integers into binaries. Keep
> integers sorted so you can bsearch on given binary chunk.

There is no point in using binaries to "pack" numbers. On a 64-bit 
Erlang VM, the majority of those integers will be a fixnum anyway, 
represented by a single 64-bit word. The remaining ones will be encoded 
as bignums, which is just as efficient as using binaries (and both 
simpler and faster).

I tried building an ordered_set ETS table of {N}-tuples, and it used 7 
words per entry. For 500M entries, it would need 28 GB, so if you have 
32 GB, it could work for you. Advantages: simple, and also fast.

I assume that these numbers are something like hashes, so that the 
distribution is completely random, and since the keyspace is 64 bits, 
you can't use tricks like bitmaps. In fact, the array module can be a 
pretty space-efficient representation of this kind of data, and is even 
reasonably fast.

The following (which populates the array with 'true' for all N in the 
list, but doesn't count the lists:seq/1 call as part of the execution 
time) took about 4 seconds on my machine for 1M entries:

1> ((fun(L) -> element(1, timer:tc(fun () -> lists:foldl(fun (X, A) -> 
array:set(X, true, A) end, array:new(), L) end)) 
end)(lists:seq(1,1000000))).
4008000

Taking the data structure size in words instead of the time:

2> ((fun(L) -> erts_debug:flat_size(timer:tc(fun () -> lists:foldl(fun 
(X, A) -> array:set(X, true, A) end, array:new(), L) end)) 
end)(lists:seq(1,1000000))).
1233424

it uses 9 MB for 1M entries. I upped the size to 100M entries and could 
comfortably populate the array in a few minutes on my home PC, resulting 
in a data structure of size 123M words (940 MB).

My example of course uses a dense data set, and so it has the smallest 
possible representation. The array module handles sparse arrays pretty 
well, but if entries tend to be spaced 10 or more apart and are rarely 
clustered close to each other, it will use about 10 times as much space 
as a dense array, so it might or might not suite you; try it and see. 
Uninitialized entries will simply return 'undefined', and there are 
traversal functions that will skip uninitialized entries for you.

     /Richard




More information about the erlang-questions mailing list