Long string to short ID

Sat Aug 14 21:32:56 CEST 2021

Hi Hugo,

It’s interesting how what from a distance looks like a detail is fraught with deep implications.  When dealing with book titles we have think about all you’ve suggested as well as how we’ll represent them as file names.  

I’m preserving your post in my lab notes file.

You and others have given me much to consider and I’m vastly grateful.

I’m almost finished with the stop-word filter we discussed back a bit. Happy to share the code with anyone interested and, certainly, learn from critiques.

All the best,

LRP

Sent from my iPad

> On Aug 14, 2021, at 2:38 PM, Hugo Mills <hugo@REDACTED> wrote:
> 
>    I think there's a question of the use-case here.
> Machine-processable IDs (your "ygjre3*)7x?" are generally designed to
> be *unique* in some way, often *unguessable*, and sometimes
> *decentralised*. All three of these contraints tend to increase the
> amount of information the ID needs to contain (i.e. the number of bits
> it encodes). If you're only worrying about a single author, and don't
> need unguessable or decentralised, then 8 bits (two hex characters) is
> sufficient for all except the likes of Isaac Asimov and Barbara
> Cartland, who would need another couple of bits each. It's fairly easy
> to come up with schemes that encode those few bits into something
> readable (say, pick 16 consonants (4 bits) and your favourite 4 vowels
> (2 bits), and you can produce a generally pronounceable four-letter
> word with 12 bits of information in it by alternating consonants and
> vowels).
> 
>   When your scope grows, and your constraints come in, then the IDs
> need to have more bits, and we tend to throw away readability in
> favour of compactness, often by things like base-64 encoding, which is
> 6 bits per character. This generally doesn't matter, because by that
> point, the IDs are only meant for machines, and the machine can put a
> nice label on the thing being identified in the user interface. So you
> get a list of books with their authors, and a button by each book that
> takes you to the details of that book. Only the machine needs to know
> that that book is "o34rdh2493fh8" and that clicking the button takes
> you to the information for "o34rdh2493fh8". All the user knows is that
> they're seeing "How to Make Long Book Titles and Annoy People" by Joe
> Q. Author.
> 
>   The translation layer between the two isn't generally going to be a
> problem -- if it's a small one-user system, then that lookup table is
> basically lost in the noise. If it's for the whole history of an
> entire publishing house, then it goes in a database (and is probably
> still lost in the noise for their whole data storage).
> 
>   Personally, I'd mint a (nearly) guaranteed unique ID like a UUIDv4
> from the outset, and use that as a key in all parts of the system to
> link any information to a book (like, say, its text, or the database
> of rejection letters). Then never expose that directly to the user in
> the UI -- although it may leak in the form of web links, say, but
> those aren't meant to be human readable anyway.
> 
>   That opinion holds regardless of the size of the system. It's
> possible to get away with funky things like IDs based on the title,
> provided you've got a small system and you want the users to be
> reading them, but (a) there's always a bigger user than you expected,
> and (b) the *users* don't want to be reading your IDs. :)
> 
>   Hugo.
> 
>> On Sat, Aug 14, 2021 at 12:05:27PM -0400, Lloyd R. Prentice wrote:
>> Hi Michael,
>> 
>> You hint at an interesting line of thought:
>> 
>> The question is, why does “My Long and Fascinating Book Title” feel like such an awkward and inefficient ID in a computer application focusing on books? After all, it is perfectly fine as a signifier of the physical or electronic object in human discourse.
>> 
>> So why is it more awkward and inefficient than “ygjre3*)7x?” when used as an ID in computer code? After all, it would in some sense make the code more readable to humans.
>> 
>> Unless I’m missing something, it seems that it’s only more inefficient in the sense that it consumes more memory in RAM and persistent storage and, arguably, processing of the string itself.
>> 
>> So, we program some kind of translation layer that associates “My Long and Fascinating Book Title” with “ygjre3*)7x?” — a proplist or some such. Now humans get to consume the literal title and the machine gets the weird, presumably more efficient, string.
>> 
>> The question then becomes, how much memory does the translation layer consume and how much latency is involved in the translation process? In other words, how many book titles have to be entered into the system before the costs of the translation layer are amortized?
>> 
>> At this point my head hurts, but it seems that there is some application specific number N where less than N books justifies using the book title itself as the ID.
>> 
>> Am I missing something?
>> 
>> All the best,
>> 
>> LRP
>> 
>> 
>> 
>> 
>> 
>> Sent from my iPad
>> 
>>>> On Aug 14, 2021, at 10:12 AM, Michael P. <empro2@REDACTED> wrote:
>>> On Fri, 13 Aug 2021 15:44:29 -0400
>>> "Lloyd R. Prentice" <lloyd@REDACTED> wrote:
>>> 
>>>> What might be a nifty way to turn a long book title with spaces into a short human-readable ID?
>>> 
>>> Two observations:
>>> 
>>> Anything too nifty will, sooner or later, put a hole in one's foot.
>>> 
>>> Keeping the beginning helps evoke a context in one's mind in which
>>> a following, nifty brixngnaxl may be meaningfully interpreted.
>>> 
>>> Examples:
>>> 
>>>   $ ls
>>>   verse.tex    verseses.tex
>>> 
>>> I do not remember what I meant "verseses" to mean. (sounds Gollumic ...)
>>> Here I have obviously niftied myself in the foot.
>>> 
>>>   $ ls fertig
>>>   aquestionofmust.tex    hinterdg.tex
>>> 
>>> Simple omission of space (and no capitals).
>>> + kept head and acronymic tail: hinterdg -> Hinter den Grenzen
>>> 
>>> But it all depends on what ID means here
>>> and what is considered "human-readable".
>>> And why the title is no human-readable ID,
>>> and why a human needs to read any other ID,
>>> why the machine cannot map any kind of ID
>>> to the title for the human.
>>> 
>>> 
>>>> focus on two most significant words in the title
>>> 
>>> Significance depends on context even in a single human being;
>>> and context depends on time and all the rest of the "situation".
>>> See the foot-holing example above.
>>> 
>>> Automating "significance" might require one to wait until
>>> androids do not dream of electric sheep anymore ...
>>> 
>>> 
>>> ~M
>>> 
> 
> -- 
> Hugo Mills             | Have found Lost City of Atlantis. High Priest is
> hugo@REDACTED carfax.org.uk | winning at quoits.
> http://carfax.org.uk/  |
> PGP: E2AB1DE4          |                                       Terry Pratchett