[erlang-questions] A bug? - bloom filters - and loadable BIFs
Paulo Sérgio Almeida
psa@REDACTED
Wed Apr 29 12:43:07 CEST 2009
Hi all,
Joe Armstrong wrote:
> So he sent me a program, which crashes his machine (out of memory)
> which he thought
> should work - to my eye it looks ok - It will not be blindingly fast,
> but it looks properly
> tail-recursive so the garbage collector should not have any problems.
>
> I've appended at the end of this mail - so you can see for yourself.
> It is a GC bug
> or a user error?
I have looked at the program. It basically uses a single large binary,
which means that in each insertion K (for K hashes) new large binaries
are generated. Each one will be fully copied (right? this would be an
interesting case for aggressive optimizations in binary handling).
One easy improvement would be to use a variant of bloom filters that
uses K bitvectors instead of one single vector. But it would not be
enough. As Joe says:
> (we want to flip specific bits) - the "pure" answer I guess is a tuple
> of tuples of binaries
> (say 32x32xB) where B is a 128 byte binary for 2^20 bits.
>
> The "tuple of tuple of tuple ..." representation is commonly used to
> implement array and dicts
something like the above. But I would add one more twist. Use an array
not of binaries but of integers. I have once implemented a data type for
representing sets with bitmaps. (There I used ets instead of array, but
that is orthogonal.) First I used binaries but later I switched to
integers, exploiting the fact that erlang gives us arbitrary size
integers. It has some advantages:
- the space overhead is smaller than for binaries (specially for
smallish entries).
- for structures that may contain many zeroes, like in this case, a
single word for each 0 would be used. This means we could overdimension
the filter with less impact if it was very little used (again I am
thinking about small entries in the array, much less that the 128 bytes
example).
If I have time I will see If I can write a new version. Better, there is
a nice variant "Scalable Bloom Filters" [1] (shameless self promotion
:)) that can scale arbitrarily without forcing us dimension the filter.
It can be trivially implemented on top of a standard one (the
bidimensional variant).
Regards,
Paulo
1.
Scalable Bloom Filters
Information Processing Letters
Volume 101, Issue 6, 31 March 2007, Pages 255-261
More information about the erlang-questions
mailing list