[erlang-questions] Warning: erlang:phash2 output
Paulo Sérgio Almeida
psa@REDACTED
Wed May 6 17:51:49 CEST 2009
Hi all,
This is a warning to whoever is using erlang:phash2.
While considering the use of phash2 as a hash function (to use in a
bloom filter) as it is fast, I was horrified at its behaviour, specially
for atoms:
Eshell V5.7.1 (abort with ^G)
1> 1 bsl 27.
134217728
2> erlang:phash2(aaa).
26481
3> erlang:phash2(bbb).
26754
4> erlang:phash2(ccc).
27027
It always returns small numbers, not uniformly distributed in the range
(which is 0..2^27-1 by default). It gets worse:
9> erlang:phash2(a).
97
10> erlang:phash2(b).
98
11> erlang:phash2(c).
99
simple ASCII codes?
This makes it basically unusable. (And I was thinking of partitioning
the resulting bit pattern in two, to get two hashes ... sigh.) I never
used phash2 before, but thought it would be more or less decent.
Or is phash2 not that bad for terms other than atoms? It looks so.
8> erlang:phash2({a}).
35332806
9> erlang:phash2({b}).
59609259
10> erlang:phash2({c}).
76435732
One could define some function, e.g.:
myhash(Term, Range) -> erlang:phash2({Term}, Range).
But it doesn't make one very confident ...
So, what are the alternatives when not many bits in the result are
needed? crypto:sha or crypto:md5 are about 30 times slower ..
Regards,
Paulo Almeida
More information about the erlang-questions
mailing list