Long string to short ID

Lloyd R. prentice lloyd@REDACTED
Fri Aug 13 22:34:07 CEST 2021


Thanks all,

Hugo, I like your third idea. I've been thinking about programming a stop word filtering function anyway. Plus, in my use case all of the  books are owned by the author so uniqueness is unlikely to be a problem.   

I can't  use ISBNs, since the ids are for books under development.  Bit I will definitely use them in other parts of my application.

I did program one idea:

make_id(String, First, Second) ->
   List = string:tokens(String, " "),
   F = lists:nth(First, List),
   S = lists:nth(Second, List),
   F ++ "_" ++ S.

 make_id(String, First) ->
   List = string:tokens(String, " "),
   F = lists:nth(First, List),
   F.

It nicely fulfills the short and readable criteria and enables focus on two most significant words in the title, but I can't see a way to automate assignment of values to First and Second. So I played with just selecting the first or first two words in the title. But it makes me uncomfortable.

make_id(String) ->
   List = string:tokens(String, " "),
   case length(List) > 1 of
      true ->   F = lists:nth(1, List),
                S = lists:nth(2, List),
                F ++ "_" ++ S;
      false -> lists:nth(1, List)
   end.

Best wishes,. Much appreciate the help.

LRP






On Fri, Aug 13, 2021, at 4:19 PM, Hugo Mills wrote:
> On Fri, Aug 13, 2021 at 03:44:29PM -0400, Lloyd R. Prentice wrote:
> > Hello,
> > 
> > What might be a nifty way to turn a long book title with spaces into a short human-readable ID?
> 
>    Depends rather on what purpose you want to put this ID to.
> 
>    One solution would be to hash it (with, say sha256). If the hash is
> too long for "short", truncate it. Note that this is not a
> globally-unique value, as there are lots of books with identical
> titles.
> 
>    If you want a globally unique identifier for printed books, then
> ISBN is a reasonable one to use -- it's not precisely unique (there
> have been errors assugning the same ISBN to two different books, for
> example), but it's pretty good for most purposes.
> 
>    If you want an actual globally unique identifier, then some form of
> UUID would do the job (UUIDv4 is the easiest). Alternatively, you
> could register a DOI prefix and assign numbers inside your own
> numberspace within the DOI system.
> 
>    If you want something vaguely human-readable, try dropping all the
> stop-words (the, a, an, in, on, ...), all the vowels and all the
> spaces. Truncate at whatever your idea of "short" is. Like the hashing
> approach, it's not unique in the slightest.
> 
>    It all depends on your use-case.
> 
>    Hugo.
> 
> -- 
> Hugo Mills             | Great films about cricket: Interview with the Umpire
> hugo@REDACTED carfax.org.uk |
> http://carfax.org.uk/  |
> PGP: E2AB1DE4          |
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20210813/5ca6645b/attachment.htm>


More information about the erlang-questions mailing list