[erlang-questions] Printed list to Erlang function

Richard A. O'Keefe ok@REDACTED
Thu Mar 31 03:59:10 CEST 2016


On 31/03/16 1:45 pm, Jason Douglas wrote:
> Part of the reason I’d expect this, is that as the data set grows in 
> size, the time to access memory should become the dominant factor, as 
> opposed to CPU work (like hashing the key). A hash table with minimal 
> collisions should always be about 1 or 2 memory accesses.
No, no, NO!  A hash table has to access ALL THE MEMORY OF THE KEY.
Suppose a list is a sequence of pairs, a pair just being a head cell and 
a tail
cell with no header, so that a list of N elements takes 2N words plus any
extra space for elements.  Then looking up "1...N" in a map is going to need
*at least*
    1 memory reference to determine that the map is a map
    1 memory reference to determine the size of the map
   2N memory references to compute the key's hash
    1 memory reference to pick up a matched key from a bucket
   4N memory references to check that the key sought and the
     key found are the same
    1 memory reference to pick up the associated value
for a total of 4 + 6N memory references.

If the keys are small integers, replace 2N by 1, for a total of 7.
(This is why I recommended working with Unicode codepoints as
integers yesterday, instead of as lists of hex digits.)

If you use binaries <<"1...N">> instead of lists, that's going to
take something like 4 + ceiling(N/4) words on a 32-bit machine
so we get 16 + N/2 memory references.

If the average word is 6 letters, then we're looking at FORTY
memory references for a list or NINETEEN for a binary and
this is the *best* case.

NOT "1 or 2".  Hash tables are *not* magic.  Key hashing
and comparison take real work, not clairvoyance.

 > A trie on the other hand will typically be /at least/ 1 or 2, often 
many more.

Recall that a hash table has to do 3N character visits:
1N to compute the hash, and 2N to verify equality.
A trie just does 1N character visits; no equality test
at the end is needed.

It's *really* not clear which will win; in any given case I would not
presume to guess but would benchmark.

Also, tries *might* win on space by sharing prefixes.
"food", "fool", "foolish", "foot", "footfall" will all
pass through the *same* nodes as the trie steps
through "foo".

This can be taken further, to Directed Acyclic Word
Graphs (DAWGs; I've always loved that acronym) which
can share suffixes as well as prefixes.  There are also
data structures like suffix arrays.  You'll find a range
of suitable data structures described in information
retrieval textbooks, especially Baeza-Yates.





More information about the erlang-questions mailing list