[erlang-questions] very large key lookup

Tue Sep 26 03:57:54 CEST 2006

On Sep 26, 2006, at 9:29 AM, Bob Ippolito wrote:

> On 9/25/06, Richard A. O'Keefe <ok@REDACTED> wrote:
>> ke han <ke.han@REDACTED> wrote:
>>         We will have hundreds of millions of these 32 byte keys in  
>> the the
>>         server's "DB" and each key's associated 128 byte key.
>>         I am looking for advice on the format for these keys along  
>> with which
>>         libraries to use for balance between memory  and lookup  
>> performance.
>>         The current format is 0-9 + a-z character strings.
>>
>> I note that there are 36 characters in the 0-9 + a-z character set.
>> This means that one can pack 3 of these characters into 2 bytes.
>> So 32 of these characters can be packed easily in 22 bytes, not 32.
>> To put it another way, a key space of /[0-9a-z]{32}/ is equivalent
>> to a key space of about 166 bits.
>>
>> The internal form of these keys could well be Erlang binaries;
>> there doesn't seem to be any reason why the external form should  
>> change.
>>
>> I do wonder, though, why
>> 63,340,286,662,973,277,706,162,286,946,811,886,609,896,461,828,096
>> primary keys are needed; enough to give every human being now alive
>> 10**40 different keys.  Even if we consider every human being who has
>> ever lived, and give each of them 3 million computers, that's still
>> 10**32 each.  This *looks* like far more extra bits than are needed
>> for reliability (error detecting codes) and security.  What am I  
>> missing?
>
> 1> math:log(math:pow(36, 32)) / math:log(2).
> 165.438

thanks for the erlang math tutorial !!  I've yet to dig into the math  
libraries.

>
> Sounds like a cryptographic hash function. SHA-1 fits in 160 bits. MD5
> plus a 32-bit timestamp fits in 160 bits.
>
> I'm guessing the key space is going to be pretty sparse rather than
> serial. Probably easier to insert data into a table like that over a
> cluster since you can pretty much bet against collisions ever
> happening and you don't have to encode a node id into the key.. which
> lets the data can move around pretty freely as the cluster grows.

yes,. these keys are not for your typical sequential numbering of  
records.  They are generated by a new algorithm which I can't say  
much about at the moment.  There is only one piece of data associated  
with each 32 byte key which is a 128 byte value which is used to  
filter out false positives.  My need is for very scalable key lookup;  
I don't need atomic transactions for insert, update, etc...

thanks, ke han

>
> -bob
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://www.erlang.org/mailman/listinfo/erlang-questions