[erlang-questions] string:lexeme/s2 - an old man's rant

Hugo Mills hugo@REDACTED
Tue May 7 09:56:01 CEST 2019

On Mon, May 06, 2019 at 05:40:48PM -0400, lloyd@REDACTED wrote:
> Hi,
> This has come up before with various work-arounds suggested. Apologies for this old-man's rant, but every time I run across the impending death of string:tokens/2 to the glory of string:lexemes/2 my blood pressure rises.
> I HATE IT. I HATE IT. I HATE IT, not least because the terms lexeme and grapheme are ugly inside-baseball words. Reading the docs, I have to do a Google search to understand what these obscure terms are referring to-- precious time wasted. And with my waning years, I don't have time to waste.
> Even my spell-checker doesn't recognize them.
> I get the desirability of welcoming unicode into Erlang. But can't we come up with friendlier nomenclature or, at least revise the docs so they don't sound like copy-and-paste out a academic linguistics journal? 

   Most of the other words you might want to use are already in use
for other things. Modern (computer) representation of writing systems
is complicated, and there's not enough words to go round the existing
concepts. Particularly words without well-known and either misleading
or overly-narrow definitions -- see my comment on "letters", below.

   For the two particular words you're complaining of here, I think of
them thus:

   graphemes, like graphology(*), are to do with the way that
   something's written on the page -- the shape and composition of the
   symbols. It's essentially a letter plus all of its diacritics (but
   it's not defined as such, because there are some graphemes that are
   ligatures of two or more letters, and some languages where each
   grapheme is a word in its own right).

   lexemes, like a lexicon, are to do with words, and are therefore
   groups of (certain kinds of) graphemes.


(*) For all that it's unsubstantiated in its psychometric claims.

