[erlang-questions] string:lexeme/s2 - an old man's rant

Wed May 8 16:18:17 CEST 2019

On 2019年5月8日水曜日 10時53分25秒 JST Richard O'Keefe wrote:
> For what it's worth, in Unicode, Line Separator and Paragraph
> Separator are the recommended characters, with CR, LF, CR+LF,
> and of arguably NEL (U+0085) being "legacy".
> 
> Again for what it's worth, Unicode defines an algorithm for
> breaking text into word( token)s.

I don't really mind the term "lexeme", but I've wondered why the
existing tokens/2 function wasn't simply updated to work the way
lexemes/2 works.

If we needed a new function, it seems the name "tokenize/2" might
have been an easier mental adjustment.

But anyway, naming things is hard and... meh. For me the unicode
enhancements are a big enough deal that I could *almost* care less
what they are called.

That said, who isn't going to open a new language's string lib and
expect to find things called "split" "tokenize"/"tokens", "clean",
"right", "left", "pad", etc.?

-Craig