[erlang-questions] string:lexeme/s2 - an old man's rant

Lloyd R. Prentice lloyd@REDACTED
Wed May 8 00:02:08 CEST 2019


Hi Michael,

Your point re users, e.g.,  documentation consumers, who know too little vs. users who know too much is well taken. Count me in the first bunch.

> Both names force meaning onto mere substrings plucked from
> some string argument chopped to pieces at some separator
> characters (with `tokens`) or substrings (`lexemes`).

If I had a god-like wand I’d do a survey of all instances in all computer languages in which the programmer intends to split natural language text into a list at indices that mark the beginning of some predefined sub-segment of the text where that sub segment may recur zero to n times.

Yuk! Even trying to describe the problem abstractly gets ugly fast since, as Richard astutely points out, we don’t have terminology we can agree on. Is a “string” a passage of natural language, an array of bytes, an arrangement of bits irrespective of byte boundaries, an Erlang list, or some other entity?

Seems to me that Unicode valiantly wrestles with the problem of mapping analog representations of “meaning” into digital representation. Problem is that there are countless ways of making analog marks on paper, stone, or what have you. Even the universe of marks used by the 7,000+ living languages is a formidable number. When we try to map these into the digital realm we either shamefully waste memory resources by giving them all equal length, or we’re forced to deal with the nasty problem of determining where one mark ends and the next begins in our digital space.

ASCII solves this problem quite elegantly at the price of excluding much of the world’s population. Unicode is far more inclusive at the price of greater code complexity and muddled discourse re naming of parts. You pay you money and you take your choice.

I’m arguing for choice. Keep the simple ASCII-based string functions in the Erlang string library and either create a separate Unicode library or provide Unicode string functions with with more suggestive/evocative  names.

All the best,

Lloyd

P.S. Michael— I’m all for clearer documentation with illustrative examples.






















Sent from my iPad

> On May 7, 2019, at 4:17 PM, <empro2@REDACTED> <empro2@REDACTED> wrote:
> 
> Both names force meaning onto mere substrings plucked from
> some string argument chopped to pieces at some separator
> characters (with `tokens`) or substrings (`lexemes`).




More information about the erlang-questions mailing list