[erlang-questions] very large key lookup

Tue Sep 26 03:29:05 CEST 2006

On 9/25/06, Richard A. O'Keefe <ok@REDACTED> wrote:
> ke han <ke.han@REDACTED> wrote:
>         We will have hundreds of millions of these 32 byte keys in the the
>         server's "DB" and each key's associated 128 byte key.
>         I am looking for advice on the format for these keys along with which
>         libraries to use for balance between memory  and lookup performance.
>         The current format is 0-9 + a-z character strings.
>
> I note that there are 36 characters in the 0-9 + a-z character set.
> This means that one can pack 3 of these characters into 2 bytes.
> So 32 of these characters can be packed easily in 22 bytes, not 32.
> To put it another way, a key space of /[0-9a-z]{32}/ is equivalent
> to a key space of about 166 bits.
>
> The internal form of these keys could well be Erlang binaries;
> there doesn't seem to be any reason why the external form should change.
>
> I do wonder, though, why
> 63,340,286,662,973,277,706,162,286,946,811,886,609,896,461,828,096
> primary keys are needed; enough to give every human being now alive
> 10**40 different keys.  Even if we consider every human being who has
> ever lived, and give each of them 3 million computers, that's still
> 10**32 each.  This *looks* like far more extra bits than are needed
> for reliability (error detecting codes) and security.  What am I missing?

1> math:log(math:pow(36, 32)) / math:log(2).
165.438

Sounds like a cryptographic hash function. SHA-1 fits in 160 bits. MD5
plus a 32-bit timestamp fits in 160 bits.

I'm guessing the key space is going to be pretty sparse rather than
serial. Probably easier to insert data into a table like that over a
cluster since you can pretty much bet against collisions ever
happening and you don't have to encode a node id into the key.. which
lets the data can move around pretty freely as the cluster grows.

-bob