Long string to short ID

Sat Aug 14 20:37:28 CEST 2021

   I think there's a question of the use-case here.
Machine-processable IDs (your "ygjre3*)7x?" are generally designed to
be *unique* in some way, often *unguessable*, and sometimes
*decentralised*. All three of these contraints tend to increase the
amount of information the ID needs to contain (i.e. the number of bits
it encodes). If you're only worrying about a single author, and don't
need unguessable or decentralised, then 8 bits (two hex characters) is
sufficient for all except the likes of Isaac Asimov and Barbara
Cartland, who would need another couple of bits each. It's fairly easy
to come up with schemes that encode those few bits into something
readable (say, pick 16 consonants (4 bits) and your favourite 4 vowels
(2 bits), and you can produce a generally pronounceable four-letter
word with 12 bits of information in it by alternating consonants and
vowels).

   When your scope grows, and your constraints come in, then the IDs
need to have more bits, and we tend to throw away readability in
favour of compactness, often by things like base-64 encoding, which is
6 bits per character. This generally doesn't matter, because by that
point, the IDs are only meant for machines, and the machine can put a
nice label on the thing being identified in the user interface. So you
get a list of books with their authors, and a button by each book that
takes you to the details of that book. Only the machine needs to know
that that book is "o34rdh2493fh8" and that clicking the button takes
you to the information for "o34rdh2493fh8". All the user knows is that
they're seeing "How to Make Long Book Titles and Annoy People" by Joe
Q. Author.

   The translation layer between the two isn't generally going to be a
problem -- if it's a small one-user system, then that lookup table is
basically lost in the noise. If it's for the whole history of an
entire publishing house, then it goes in a database (and is probably
still lost in the noise for their whole data storage).

   Personally, I'd mint a (nearly) guaranteed unique ID like a UUIDv4
from the outset, and use that as a key in all parts of the system to
link any information to a book (like, say, its text, or the database
of rejection letters). Then never expose that directly to the user in
the UI -- although it may leak in the form of web links, say, but
those aren't meant to be human readable anyway.

   That opinion holds regardless of the size of the system. It's
possible to get away with funky things like IDs based on the title,
provided you've got a small system and you want the users to be
reading them, but (a) there's always a bigger user than you expected,
and (b) the *users* don't want to be reading your IDs. :)

   Hugo.

On Sat, Aug 14, 2021 at 12:05:27PM -0400, Lloyd R. Prentice wrote:
> Hi Michael,
> 
> You hint at an interesting line of thought:
> 
> The question is, why does “My Long and Fascinating Book Title” feel like such an awkward and inefficient ID in a computer application focusing on books? After all, it is perfectly fine as a signifier of the physical or electronic object in human discourse.
> 
> So why is it more awkward and inefficient than “ygjre3*)7x?” when used as an ID in computer code? After all, it would in some sense make the code more readable to humans.
> 
> Unless I’m missing something, it seems that it’s only more inefficient in the sense that it consumes more memory in RAM and persistent storage and, arguably, processing of the string itself.
> 
> So, we program some kind of translation layer that associates “My Long and Fascinating Book Title” with “ygjre3*)7x?” — a proplist or some such. Now humans get to consume the literal title and the machine gets the weird, presumably more efficient, string.
> 
> The question then becomes, how much memory does the translation layer consume and how much latency is involved in the translation process? In other words, how many book titles have to be entered into the system before the costs of the translation layer are amortized?
> 
> At this point my head hurts, but it seems that there is some application specific number N where less than N books justifies using the book title itself as the ID.
> 
> Am I missing something?
> 
> All the best,
> 
> LRP
> 
> 
> 
> 
> 
> Sent from my iPad
> 
> > On Aug 14, 2021, at 10:12 AM, Michael P. <empro2@REDACTED> wrote:
> > On Fri, 13 Aug 2021 15:44:29 -0400
> > "Lloyd R. Prentice" <lloyd@REDACTED> wrote:
> > 
> >> What might be a nifty way to turn a long book title with spaces into a short human-readable ID?
> > 
> > Two observations:
> > 
> > Anything too nifty will, sooner or later, put a hole in one's foot.
> > 
> > Keeping the beginning helps evoke a context in one's mind in which
> > a following, nifty brixngnaxl may be meaningfully interpreted.
> > 
> > Examples:
> > 
> >    $ ls
> >    verse.tex    verseses.tex
> > 
> > I do not remember what I meant "verseses" to mean. (sounds Gollumic ...)
> > Here I have obviously niftied myself in the foot.
> > 
> >    $ ls fertig
> >    aquestionofmust.tex    hinterdg.tex
> > 
> > Simple omission of space (and no capitals).
> > + kept head and acronymic tail: hinterdg -> Hinter den Grenzen
> > 
> > But it all depends on what ID means here
> > and what is considered "human-readable".
> > And why the title is no human-readable ID,
> > and why a human needs to read any other ID,
> > why the machine cannot map any kind of ID
> > to the title for the human.
> > 
> > 
> >> focus on two most significant words in the title
> > 
> > Significance depends on context even in a single human being;
> > and context depends on time and all the rest of the "situation".
> > See the foot-holing example above.
> > 
> > Automating "significance" might require one to wait until
> > androids do not dream of electric sheep anymore ...
> > 
> > 
> > ~M
> > 

-- 
Hugo Mills             | Have found Lost City of Atlantis. High Priest is
hugo@REDACTED carfax.org.uk | winning at quoits.
http://carfax.org.uk/  |
PGP: E2AB1DE4          |                                       Terry Pratchett