[erlang-questions] program: Bloom filter (probabilistic lookup table) --- performance figures needed

Mon Jan 21 21:17:31 CET 2008

Hello all,

I've implemented Bloom filter (http://en.wikipedia.org/wiki/ 
Bloom_filter) as a little excercise on R12B's bitstrings. I made few  
versions to see how to make them perform and how to not get warnings  
about unoptimised bitstring usage (ok, inserting key gives warning,  
but that's not that interesting, looking up keys is). My compliments  
for the new Erlang Efficiency Guide and for the bitstring syntax  
support to all involved parties!

I made some tests to see how Bloom filter performs when compared  
against ets in looking up inserted keys. I've got Mac Mini Core Duo  
so I can't see how native compiled Bloom filter would do compared to  
ets. With interpreted code Bloom filter is mostly slower with more  
than 1 hash (k > 1). Unfortunately the false positive rate is rather  
high (depending on key population and number of bits in array, of  
course) in that case.

So if native compilation would boost hash calculation (seems to be  
the bottleneck) several times BloomIER filter could be usable for  
example doing number analyser (if false positives are taken into  
account --- see wikipedia article). Changing erlang:phash2 to a more  
specific set of hash functions could also make difference. Or  
dynamically generate and compile code to match all bits at a time  
(would require sorting offsets and calculating their intervals ---  
perhaps slower).

Anyways, if somebody with platform supporting native compilation  
could report some numbers, I'd be grateful. The test reports the size  
of the bit array (M), number of keys inserted (N), number of hashes  
(bits) per key (K), false positives (FP) --- also in percentages,  
inserting keys into Bloom filter in microseconds (IB), inserting keys  
into ets table (IE), looking up keys from Bloom filter (LB) and  
finally looking up keys from ets table (LE). Times in microseconds.  
The generated test sets could be memoized for better testing  
performance but I couldn't bother.

Here's some number I got:
M=2048 N=256 K=1 FP 10537/89744 (11.74%) IB 600 IE 545 LB 134 LE 162
M=8192 N=256 K=4 FP 17/89744 (0.02%) IB 6149 IE 486 LB 386 LE 156
M=8192 N=1024 K=8 FP 2413/88976 (2.71%) IB 50451 IE 1069 LB 2636 LE 622
M=65536 N=4096 K=2 FP 1165/85904 (1.36%) IB 354567 IE 4950 LB 3460 LE  
2792
M=65536 N=8192 K=8 FP 2112/81808 (2.58%) IB 2124279 IE 9471 LB 21683  
LE 5429
M=131072 N=8192 K=2 FP 1147/81808 (1.40%) IB 1090229 IE 9164 LB 6729  
LE 5413
M=131072 N=8192 K=10 FP 75/81808 (0.09%) IB 5092386 IE 9710 LB 26914  
LE 5566

Here's the code (public domain --- do what you wish):
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test_bloom_filter.erl
Type: application/octet-stream
Size: 3392 bytes
Desc: not available
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20080121/d7dc584f/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bloom_filter2.erl
Type: application/octet-stream
Size: 3853 bytes
Desc: not available
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20080121/d7dc584f/attachment-0001.obj>
-------------- next part --------------

Cheers,
Jani L.