[erlang-questions] string:lexeme/s2 - an old man's rant

Tue May 7 15:45:17 CEST 2019

Words ending with the morpheme "-eme" generally come from linguistics.
In particular, "grapheme" can only be defined with respect to a
particular writing system.  "In linguistics
<https://en.wikipedia.org/wiki/Linguistics>, a *grapheme* is the smallest
unit of a writing system <https://en.wikipedia.org/wiki/Writing_system> of
any given language."  For example, in English,
"ë" is two graphemes, an "e" grapheme, and a "pronounce this vowel
separately" grapheme.  In other European languages, "e" and "ë" are
quite separate letters.

What Hugo Mills described is not a grapheme but a grapheme *cluster*.

We have code unit, code point, glyph, character, grapheme, grapheme
cluster, and a bunch of other terms that are pretty much identical
in ASCII or ISO 8859 but when you make a serious attempt to encode
all the scripts anyone wants to use on a computer, things get
horribly complicated.  And they get complicated in *language-specific*
ways.  (Like case conversion.  You can't really do case conversion in
Unicode without knowing what language you are concerned with.)  This
always *was* complicated in the real world, but people in Western
Europe and the Americas were mostly able to ignore it.  (Things got
somewhat complicated in NZ where the indigenous language uses a
Latin-based script with macrons and where wh and ng count as single
letters.)

Curiously, in the Unicode 12 standard, "grapheme" is not in the index,
but "grapheme base", "grapheme cluster", and "grapheme extender", for
example, are.

I suspect that the word "grapheme", precisely because it is a
language-dependent technical term with some surprising twists,
may not be a good word to use here.

"Lexeme" is, if anything worse. "A *lexeme*  is a unit of lexical
<https://en.wikipedia.org/wiki/Lexical_semantics> meaning that
underlies a set of words that are related through inflection. It is a basic
abstract unit of meaning, a unit <https://en.wikipedia.org/wiki/Emic_unit>
of morphological <https://en.wikipedia.org/wiki/Morphology_(linguistics)>
analysis <https://en.wikipedia.org/wiki/Semantic_analysis_(linguistics)> in
linguistics <https://en.wikipedia.org/wiki/Linguistics> that
roughly corresponds to a set of forms taken by a single root word
<https://en.wikipedia.org/wiki/Word>."  That is
NOT what it means here.  In computing, it basically means "token".  But what
*does* it mean?  In "Now we see it, now we don't." are there two lexemes
spelled "we" or is there one "lexeme" with two occurrences?  (If you ever
meet two linguists in a bar who don't know each other, try asking them what
a "word" is.  There are at least four different meanings.)

"token" has the merit of coming from one half of the type/token distinction.
In fact, that's *WHY* they are called tokens.  In "Now we see it, now we
don't" there is ONE word type "we" which has TWO tokens.

So seriously, as someone who has been reading academic linguistics for
several decades and has spent more time trying to understand Unicode than
is compatible with sanity, I think the OP's objection carries weight.
(I said I've been *reading* the stuff.  That's not always the same as
*understanding* it, and I certainly couldn't *write* like a linguist.)

On Tue, 7 May 2019 at 19:56, Hugo Mills <hugo@REDACTED> wrote:

> On Mon, May 06, 2019 at 05:40:48PM -0400, lloyd@REDACTED wrote:
> > Hi,
> >
> > This has come up before with various work-arounds suggested. Apologies
> for this old-man's rant, but every time I run across the impending death of
> string:tokens/2 to the glory of string:lexemes/2 my blood pressure rises.
> >
> > I HATE IT. I HATE IT. I HATE IT, not least because the terms lexeme and
> grapheme are ugly inside-baseball words. Reading the docs, I have to do a
> Google search to understand what these obscure terms are referring to--
> precious time wasted. And with my waning years, I don't have time to waste.
> >
> > Even my spell-checker doesn't recognize them.
> >
> > I get the desirability of welcoming unicode into Erlang. But can't we
> come up with friendlier nomenclature or, at least revise the docs so they
> don't sound like copy-and-paste out a academic linguistics journal?
>
>
>    Most of the other words you might want to use are already in use
> for other things. Modern (computer) representation of writing systems
> is complicated, and there's not enough words to go round the existing
> concepts. Particularly words without well-known and either misleading
> or overly-narrow definitions -- see my comment on "letters", below.
>
>    For the two particular words you're complaining of here, I think of
> them thus:
>
>    graphemes, like graphology(*), are to do with the way that
>    something's written on the page -- the shape and composition of the
>    symbols. It's essentially a letter plus all of its diacritics (but
>    it's not defined as such, because there are some graphemes that are
>    ligatures of two or more letters, and some languages where each
>    grapheme is a word in its own right).
>
>    lexemes, like a lexicon, are to do with words, and are therefore
>    groups of (certain kinds of) graphemes.
>
>    Hugo.
>
> (*) For all that it's unsubstantiated in its psychometric claims.
>
> --
> Hugo Mills             | I can't foretell the future, I just work there.
> hugo@REDACTED carfax.org.uk |
> http://carfax.org.uk/  |
> PGP: E2AB1DE4          |                                            The
> Doctor
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20190508/c1182a76/attachment.htm>