[erlang-questions] Printed list to Erlang function
Richard A. O'Keefe
Thu Mar 31 03:59:10 CEST 2016
On 31/03/16 1:45 pm, Jason Douglas wrote:
> Part of the reason I’d expect this, is that as the data set grows in
> size, the time to access memory should become the dominant factor, as
> opposed to CPU work (like hashing the key). A hash table with minimal
> collisions should always be about 1 or 2 memory accesses.
No, no, NO! A hash table has to access ALL THE MEMORY OF THE KEY.
Suppose a list is a sequence of pairs, a pair just being a head cell and
cell with no header, so that a list of N elements takes 2N words plus any
extra space for elements. Then looking up "1...N" in a map is going to need
1 memory reference to determine that the map is a map
1 memory reference to determine the size of the map
2N memory references to compute the key's hash
1 memory reference to pick up a matched key from a bucket
4N memory references to check that the key sought and the
key found are the same
1 memory reference to pick up the associated value
for a total of 4 + 6N memory references.
If the keys are small integers, replace 2N by 1, for a total of 7.
(This is why I recommended working with Unicode codepoints as
integers yesterday, instead of as lists of hex digits.)
If you use binaries <<"1...N">> instead of lists, that's going to
take something like 4 + ceiling(N/4) words on a 32-bit machine
so we get 16 + N/2 memory references.
If the average word is 6 letters, then we're looking at FORTY
memory references for a list or NINETEEN for a binary and
this is the *best* case.
NOT "1 or 2". Hash tables are *not* magic. Key hashing
and comparison take real work, not clairvoyance.
> A trie on the other hand will typically be /at least/ 1 or 2, often
Recall that a hash table has to do 3N character visits:
1N to compute the hash, and 2N to verify equality.
A trie just does 1N character visits; no equality test
at the end is needed.
It's *really* not clear which will win; in any given case I would not
presume to guess but would benchmark.
Also, tries *might* win on space by sharing prefixes.
"food", "fool", "foolish", "foot", "footfall" will all
pass through the *same* nodes as the trie steps
This can be taken further, to Directed Acyclic Word
Graphs (DAWGs; I've always loved that acronym) which
can share suffixes as well as prefixes. There are also
data structures like suffix arrays. You'll find a range
of suitable data structures described in information
retrieval textbooks, especially Baeza-Yates.
More information about the erlang-questions