[erlang-questions] Warning: erlang:phash2 output

Paulo Sérgio Almeida psa@REDACTED
Wed May 6 17:51:49 CEST 2009


Hi all,

This is a warning to whoever is using erlang:phash2.

While considering the use of phash2 as a hash function (to use in a 
bloom filter) as it is fast, I was horrified at its behaviour, specially 
for atoms:

Eshell V5.7.1  (abort with ^G)
1> 1 bsl 27.
134217728
2> erlang:phash2(aaa).
26481
3> erlang:phash2(bbb).
26754
4> erlang:phash2(ccc).
27027

It always returns small numbers, not uniformly distributed in the range 
(which is 0..2^27-1 by default). It gets worse:

9> erlang:phash2(a).
97
10> erlang:phash2(b).
98
11> erlang:phash2(c).
99

simple ASCII codes?

This makes it basically unusable. (And I was thinking of partitioning 
the resulting bit pattern in two, to get two hashes ... sigh.) I never 
used phash2 before, but thought it would be more or less decent.

Or is phash2 not that bad for terms other than atoms? It looks so.

8> erlang:phash2({a}).
35332806
9> erlang:phash2({b}).
59609259
10> erlang:phash2({c}).
76435732

One could define some function, e.g.:

myhash(Term, Range) -> erlang:phash2({Term}, Range).

But it doesn't make one very confident ...

So, what are the alternatives when not many bits in the result are 
needed? crypto:sha or crypto:md5 are about 30 times slower ..

Regards,
Paulo Almeida



More information about the erlang-questions mailing list