[erlang-questions] A bug? - bloom filters - and loadable BIFs

Paulo Sérgio Almeida psa@REDACTED
Wed Apr 29 12:43:07 CEST 2009


Hi all,

Joe Armstrong wrote:

> So he sent me a program, which crashes his machine (out of memory)
> which he thought
> should work - to my eye it looks ok - It will not be blindingly fast,
> but it looks properly
> tail-recursive so the garbage collector should not have any problems.
> 
> I've appended at the end of this mail - so you can see for yourself.
> It is a GC bug
> or a user error?

I have looked at the program. It basically uses a single large binary, 
which means that in each insertion K (for K hashes) new large binaries 
are generated. Each one will be fully copied (right? this would be an 
interesting case for aggressive optimizations in binary handling).

One easy improvement would be to use a variant of bloom filters that 
uses K bitvectors instead of one single vector. But it would not be 
enough. As Joe says:

> (we want to flip specific bits) - the "pure" answer I guess is a tuple
> of tuples of binaries
> (say 32x32xB) where B is a 128 byte binary for 2^20 bits.
> 
> The "tuple of tuple of tuple ..." representation is commonly used to
> implement array and dicts

something like the above. But I would add one more twist. Use an array 
not of binaries but of integers. I have once implemented a data type for 
representing sets with bitmaps. (There I used ets instead of array, but 
that is orthogonal.) First I used binaries but later I switched to 
integers, exploiting the fact that erlang gives us arbitrary size 
integers. It has some advantages:

- the space overhead is smaller than for binaries (specially for 
smallish entries).
- for structures that may contain many zeroes, like in this case, a 
single word for each 0 would be used. This means we could overdimension 
the filter with less impact if it was very little used (again I am 
thinking about small entries in the array, much less that the 128 bytes 
example).

If I have time I will see If I can write a new version. Better, there is 
a nice variant "Scalable Bloom Filters" [1] (shameless self promotion 
:)) that can scale arbitrarily without forcing us dimension the filter. 
It can be trivially implemented on top of a standard one (the 
bidimensional variant).

Regards,
Paulo

1.
Scalable Bloom Filters
Information Processing Letters
Volume 101, Issue 6, 31 March 2007, Pages 255-261



More information about the erlang-questions mailing list