[erlang-questions] string:lexeme/s2 - an old man's rant

Tue May 7 18:04:46 CEST 2019

Hi Folks,

I’m certainly not smart enough to resolve this issue. But it seems somewhat like the tension between writing software that solves every imaginable problem in the domain and software that solves the immediate problem at hand.  We know in the first case that mounting complexity quickly gets out of hand.

If I expect users from every one of the 7,000-some-odd living languages to use my web app, then no doubt Unicode beats ASCII hands down. But how should I handle prompts? A 7,000-option case statement maybe? And where do I find the translators?

Aside from that, I have no quarrel with Unicode. I’m grateful that it enables my programs to respect many language conventions. But I would much prefer keeping string:tokens/2 in the Erlang string library and renaming string:lexemes/2 to something like string:unicode_tokens/2. If nothing else, this would take a considerable burden off the documentation.

But what do I know?

All the best,

Lloyd

Sent from my iPad

> On May 7, 2019, at 9:45 AM, Richard O'Keefe <raoknz@REDACTED> wrote:
> 
> Words ending with the morpheme "-eme" generally come from linguistics.
> In particular, "grapheme" can only be defined with respect to a
> particular writing system.  "In linguistics, a grapheme is the smallest unit of a writing system of any given language."  For example, in English,
> "ë" is two graphemes, an "e" grapheme, and a "pronounce this vowel
> separately" grapheme.  In other European languages, "e" and "ë" are
> quite separate letters.
> 
> What Hugo Mills described is not a grapheme but a grapheme *cluster*.
> 
> We have code unit, code point, glyph, character, grapheme, grapheme
> cluster, and a bunch of other terms that are pretty much identical
> in ASCII or ISO 8859 but when you make a serious attempt to encode
> all the scripts anyone wants to use on a computer, things get
> horribly complicated.  And they get complicated in *language-specific*
> ways.  (Like case conversion.  You can't really do case conversion in
> Unicode without knowing what language you are concerned with.)  This
> always *was* complicated in the real world, but people in Western
> Europe and the Americas were mostly able to ignore it.  (Things got
> somewhat complicated in NZ where the indigenous language uses a
> Latin-based script with macrons and where wh and ng count as single
> letters.)
> 
> Curiously, in the Unicode 12 standard, "grapheme" is not in the index,
> but "grapheme base", "grapheme cluster", and "grapheme extender", for
> example, are.
> 
> I suspect that the word "grapheme", precisely because it is a
> language-dependent technical term with some surprising twists,
> may not be a good word to use here.
> 
> "Lexeme" is, if anything worse. "A lexeme  is a unit of lexical meaning that
> underlies a set of words that are related through inflection. It is a basic
> abstract unit of meaning, a unit of morphological analysis in linguistics that
> roughly corresponds to a set of forms taken by a single root word."  That is
> NOT what it means here.  In computing, it basically means "token".  But what
> *does* it mean?  In "Now we see it, now we don't." are there two lexemes
> spelled "we" or is there one "lexeme" with two occurrences?  (If you ever
> meet two linguists in a bar who don't know each other, try asking them what
> a "word" is.  There are at least four different meanings.)
> 
> "token" has the merit of coming from one half of the type/token distinction.
> In fact, that's *WHY* they are called tokens.  In "Now we see it, now we
> don't" there is ONE word type "we" which has TWO tokens.
> 
> So seriously, as someone who has been reading academic linguistics for
> several decades and has spent more time trying to understand Unicode than
> is compatible with sanity, I think the OP's objection carries weight.
> (I said I've been *reading* the stuff.  That's not always the same as
> *understanding* it, and I certainly couldn't *write* like a linguist.)
> 
> 
>> On Tue, 7 May 2019 at 19:56, Hugo Mills <hugo@REDACTED> wrote:
>> On Mon, May 06, 2019 at 05:40:48PM -0400, lloyd@REDACTED wrote:
>> > Hi,
>> > 
>> > This has come up before with various work-arounds suggested. Apologies for this old-man's rant, but every time I run across the impending death of string:tokens/2 to the glory of string:lexemes/2 my blood pressure rises.
>> > 
>> > I HATE IT. I HATE IT. I HATE IT, not least because the terms lexeme and grapheme are ugly inside-baseball words. Reading the docs, I have to do a Google search to understand what these obscure terms are referring to-- precious time wasted. And with my waning years, I don't have time to waste.
>> > 
>> > Even my spell-checker doesn't recognize them.
>> > 
>> > I get the desirability of welcoming unicode into Erlang. But can't we come up with friendlier nomenclature or, at least revise the docs so they don't sound like copy-and-paste out a academic linguistics journal? 
>> 
>> 
>>    Most of the other words you might want to use are already in use
>> for other things. Modern (computer) representation of writing systems
>> is complicated, and there's not enough words to go round the existing
>> concepts. Particularly words without well-known and either misleading
>> or overly-narrow definitions -- see my comment on "letters", below.
>> 
>>    For the two particular words you're complaining of here, I think of
>> them thus:
>> 
>>    graphemes, like graphology(*), are to do with the way that
>>    something's written on the page -- the shape and composition of the
>>    symbols. It's essentially a letter plus all of its diacritics (but
>>    it's not defined as such, because there are some graphemes that are
>>    ligatures of two or more letters, and some languages where each
>>    grapheme is a word in its own right).
>> 
>>    lexemes, like a lexicon, are to do with words, and are therefore
>>    groups of (certain kinds of) graphemes.
>> 
>>    Hugo.
>> 
>> (*) For all that it's unsubstantiated in its psychometric claims.
>> 
>> -- 
>> Hugo Mills             | I can't foretell the future, I just work there.
>> hugo@REDACTED carfax.org.uk |
>> http://carfax.org.uk/  |
>> PGP: E2AB1DE4          |                                            The Doctor
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20190507/bc897706/attachment.htm>