[erlang-questions] A bug? - bloom filters - and loadable BIFs

Wed Apr 29 11:25:54 CEST 2009

Joe Armstrong writes:
 > Hi,
 > 
 > I recent got mail from Steve Kirsch (Hello Steve) - asking a number
 > of questions about representation of bit vectors for writing Bloom filters.
...
 > So he sent me a program, which crashes his machine (out of memory)
...
 > new(N, E) when N > 0, is_float(E), E > 0, E =< 1 ->
 >    {M, K} = calc_least_bits(N, E),
 >    #bloom{m=M, bitmap = <<0:((M+7) div 8 * 8)>>, k=K, n=N}.

Large bitmaps are problematic in Erlang.

You can do chunked representations, which will work but will incur
high overheads on both reads and writes.

Using large binaries as the code above does is fragile and inefficient.
Large binaries are allocated outside of the GCd heaps in a global shared
area and are accessed via small heap-allocated and GCd handles. Frequent
updates will require frequent new allocations and copies in the shared
area. Old versions of these binaries are only deallocated in response to
local (per-process) GCs when their handles die.

So creating a series of large binaries in rapid succession without
intervening GCs is a fairly reliable way of running out of memory.

HiPE needs large bitmaps for its register allocator, so we added our
own bitarray BIFs since all alternatives suck. As much as _I_ like them,
I cannot recommend them for application developers:
- they're only present in HiPE-enabled systems
- they're not documented or official in any way, and may change at any time
- the effect of sending a bitarray to another process is unspecified
  (the processes may or may not share the object)
- Erlang is _supposed_ to protect application programmers from the perils
  of mutable data :-/