Long string to short ID
Lloyd R. prentice
lloyd@REDACTED
Fri Aug 13 22:34:07 CEST 2021
Thanks all,
Hugo, I like your third idea. I've been thinking about programming a stop word filtering function anyway. Plus, in my use case all of the books are owned by the author so uniqueness is unlikely to be a problem.
I can't use ISBNs, since the ids are for books under development. Bit I will definitely use them in other parts of my application.
I did program one idea:
make_id(String, First, Second) ->
List = string:tokens(String, " "),
F = lists:nth(First, List),
S = lists:nth(Second, List),
F ++ "_" ++ S.
make_id(String, First) ->
List = string:tokens(String, " "),
F = lists:nth(First, List),
F.
It nicely fulfills the short and readable criteria and enables focus on two most significant words in the title, but I can't see a way to automate assignment of values to First and Second. So I played with just selecting the first or first two words in the title. But it makes me uncomfortable.
make_id(String) ->
List = string:tokens(String, " "),
case length(List) > 1 of
true -> F = lists:nth(1, List),
S = lists:nth(2, List),
F ++ "_" ++ S;
false -> lists:nth(1, List)
end.
Best wishes,. Much appreciate the help.
LRP
On Fri, Aug 13, 2021, at 4:19 PM, Hugo Mills wrote:
> On Fri, Aug 13, 2021 at 03:44:29PM -0400, Lloyd R. Prentice wrote:
> > Hello,
> >
> > What might be a nifty way to turn a long book title with spaces into a short human-readable ID?
>
> Depends rather on what purpose you want to put this ID to.
>
> One solution would be to hash it (with, say sha256). If the hash is
> too long for "short", truncate it. Note that this is not a
> globally-unique value, as there are lots of books with identical
> titles.
>
> If you want a globally unique identifier for printed books, then
> ISBN is a reasonable one to use -- it's not precisely unique (there
> have been errors assugning the same ISBN to two different books, for
> example), but it's pretty good for most purposes.
>
> If you want an actual globally unique identifier, then some form of
> UUID would do the job (UUIDv4 is the easiest). Alternatively, you
> could register a DOI prefix and assign numbers inside your own
> numberspace within the DOI system.
>
> If you want something vaguely human-readable, try dropping all the
> stop-words (the, a, an, in, on, ...), all the vowels and all the
> spaces. Truncate at whatever your idea of "short" is. Like the hashing
> approach, it's not unique in the slightest.
>
> It all depends on your use-case.
>
> Hugo.
>
> --
> Hugo Mills | Great films about cricket: Interview with the Umpire
> hugo@REDACTED carfax.org.uk |
> http://carfax.org.uk/ |
> PGP: E2AB1DE4 |
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20210813/5ca6645b/attachment.htm>
More information about the erlang-questions
mailing list