<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body dir="auto">Hi Folks,<div><br></div><div>I’m certainly not smart enough to resolve this issue. But it seems somewhat like the tension between writing software that solves every imaginable problem in the domain and software that solves the immediate problem at hand.  We know in the first case that mounting complexity quickly gets out of hand.</div><div><br></div><div>If I expect users from every one of the 7,000-some-odd living languages to use my web app, then no doubt Unicode beats ASCII hands down. But how should I handle prompts? A 7,000-option case statement maybe? And where do I find the translators?</div><div><br></div><div>Aside from that, I have no quarrel with Unicode. I’m grateful that it enables my programs to respect many language conventions. But I would much prefer keeping string:tokens/2 in the Erlang string library and renaming string:lexemes/2 to something like string:unicode_tokens/2. If nothing else, this would take a considerable burden off the documentation.</div><div><br></div><div>But what do I know?</div><div><br></div><div>All the best,</div><div><br></div><div>Lloyd</div><div><br><div id="AppleMailSignature" dir="ltr">Sent from my iPad</div><div dir="ltr"><br>On May 7, 2019, at 9:45 AM, Richard O'Keefe <<a href="mailto:raoknz@gmail.com">raoknz@gmail.com</a>> wrote:<br><br></div><blockquote type="cite"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div class="gmail_default" style="font-family:monospace,monospace">Words ending with the morpheme "-eme" generally come from linguistics.</div><div class="gmail_default" style="font-family:monospace,monospace">In particular, "grapheme" can only be defined with respect to a</div><div class="gmail_default" style="font-family:monospace,monospace">particular writing system.  "In <a href="https://en.wikipedia.org/wiki/Linguistics" title="Linguistics" target="_blank">linguistics</a>, a <b>grapheme</b> is the smallest unit of a <a href="https://en.wikipedia.org/wiki/Writing_system" title="Writing system" target="_blank">writing system</a> of any given language."  For example, in English,</div><div class="gmail_default" style="font-family:monospace,monospace">"ë" is two graphemes, an "e" grapheme, and a "pronounce this vowel</div><div class="gmail_default" style="font-family:monospace,monospace">separately" grapheme.  In other European languages, "e" and "ë" are</div><div class="gmail_default" style="font-family:monospace,monospace">quite separate letters.</div><div class="gmail_default" style="font-family:monospace,monospace"><br></div><div class="gmail_default" style="font-family:monospace,monospace">What Hugo Mills described is not a grapheme but a grapheme *cluster*.</div><div class="gmail_default" style="font-family:monospace,monospace"><br></div><div class="gmail_default" style="font-family:monospace,monospace">We have code unit, code point, glyph, character, grapheme, grapheme</div><div class="gmail_default" style="font-family:monospace,monospace">cluster, and a bunch of other terms that are pretty much identical</div><div class="gmail_default" style="font-family:monospace,monospace">in ASCII or ISO 8859 but when you make a serious attempt to encode</div><div class="gmail_default" style="font-family:monospace,monospace">all the scripts anyone wants to use on a computer, things get</div><div class="gmail_default" style="font-family:monospace,monospace">horribly complicated.  And they get complicated in *language-specific*</div><div class="gmail_default" style="font-family:monospace,monospace">ways.  (Like case conversion.  You can't really do case conversion in</div><div class="gmail_default" style="font-family:monospace,monospace">Unicode without knowing what language you are concerned with.)  This</div><div class="gmail_default" style="font-family:monospace,monospace">always *was* complicated in the real world, but people in Western</div><div class="gmail_default" style="font-family:monospace,monospace">Europe and the Americas were mostly able to ignore it.  (Things got</div><div class="gmail_default" style="font-family:monospace,monospace">somewhat complicated in NZ where the indigenous language uses a</div><div class="gmail_default" style="font-family:monospace,monospace">Latin-based script with macrons and where wh and ng count as single</div><div class="gmail_default" style="font-family:monospace,monospace">letters.)</div><div class="gmail_default" style="font-family:monospace,monospace"><br></div><div class="gmail_default" style="font-family:monospace,monospace">Curiously, in the Unicode 12 standard, "grapheme" is not in the index,</div><div class="gmail_default" style="font-family:monospace,monospace">but "grapheme base", "grapheme cluster", and "grapheme extender", for</div><div class="gmail_default" style="font-family:monospace,monospace">example, are.</div><div class="gmail_default" style="font-family:monospace,monospace"><br></div><div class="gmail_default" style="font-family:monospace,monospace">I suspect that the word "grapheme", precisely because it is a</div><div class="gmail_default" style="font-family:monospace,monospace">language-dependent technical term with some surprising twists,</div><div class="gmail_default" style="font-family:monospace,monospace">may not be a good word to use here.</div><div class="gmail_default" style="font-family:monospace,monospace"><br></div><div class="gmail_default" style="font-family:monospace,monospace">"Lexeme" is, if anything worse. "A <b>lexeme</b>  is a unit of <a href="https://en.wikipedia.org/wiki/Lexical_semantics" title="Lexical semantics" target="_blank">lexical</a> meaning that</div><div class="gmail_default" style="font-family:monospace,monospace"> underlies a set of words that are related through inflection. It is a basic</div><div class="gmail_default" style="font-family:monospace,monospace"> abstract unit of meaning, a <a href="https://en.wikipedia.org/wiki/Emic_unit" title="Emic unit" target="_blank">unit</a> of <a href="https://en.wikipedia.org/wiki/Morphology_(linguistics)" title="Morphology (linguistics)" target="_blank">morphological</a> <a href="https://en.wikipedia.org/wiki/Semantic_analysis_(linguistics)" title="Semantic analysis (linguistics)" target="_blank">analysis</a> in <a href="https://en.wikipedia.org/wiki/Linguistics" title="Linguistics" target="_blank">linguistics</a> that</div><div class="gmail_default" style="font-family:monospace,monospace">roughly corresponds to a set of forms taken by a single root <a href="https://en.wikipedia.org/wiki/Word" title="Word" target="_blank">word</a>."  That is</div><div class="gmail_default" style="font-family:monospace,monospace">NOT what it means here.  In computing, it basically means "token".  But what</div><div class="gmail_default" style="font-family:monospace,monospace">*does* it mean?  In "Now we see it, now we don't." are there two lexemes</div><div class="gmail_default" style="font-family:monospace,monospace">spelled "we" or is there one "lexeme" with two occurrences?  (If you ever</div><div class="gmail_default" style="font-family:monospace,monospace">meet two linguists in a bar who don't know each other, try asking them what</div><div class="gmail_default" style="font-family:monospace,monospace">a "word" is.  There are at least four different meanings.)</div><div class="gmail_default" style="font-family:monospace,monospace"><br></div><div class="gmail_default" style="font-family:monospace,monospace">"token" has the merit of coming from one half of the type/token distinction.</div><div class="gmail_default" style="font-family:monospace,monospace">In fact, that's *WHY* they are called tokens.  In "Now we see it, now we</div><div class="gmail_default" style="font-family:monospace,monospace">don't" there is ONE word type "we" which has TWO tokens.</div><div class="gmail_default" style="font-family:monospace,monospace"><br></div><div class="gmail_default" style="font-family:monospace,monospace">So seriously, as someone who has been reading academic linguistics for</div><div class="gmail_default" style="font-family:monospace,monospace">several decades and has spent more time trying to understand Unicode than</div><div class="gmail_default" style="font-family:monospace,monospace">is compatible with sanity, I think the OP's objection carries weight.</div><div class="gmail_default" style="font-family:monospace,monospace">(I said I've been *reading* the stuff.  That's not always the same as</div><div class="gmail_default" style="font-family:monospace,monospace">*understanding* it, and I certainly couldn't *write* like a linguist.)</div><div class="gmail_default" style="font-family:monospace,monospace"><br></div></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, 7 May 2019 at 19:56, Hugo Mills <<a href="mailto:hugo@carfax.org.uk" target="_blank">hugo@carfax.org.uk</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Mon, May 06, 2019 at 05:40:48PM -0400, <a href="mailto:lloyd@writersglen.com" target="_blank">lloyd@writersglen.com</a> wrote:<br>

> Hi,<br>

> <br>

> This has come up before with various work-arounds suggested. Apologies for this old-man's rant, but every time I run across the impending death of string:tokens/2 to the glory of string:lexemes/2 my blood pressure rises.<br>

> <br>

> I HATE IT. I HATE IT. I HATE IT, not least because the terms lexeme and grapheme are ugly inside-baseball words. Reading the docs, I have to do a Google search to understand what these obscure terms are referring to-- precious time wasted. And with my waning years, I don't have time to waste.<br>

> <br>

> Even my spell-checker doesn't recognize them.<br>

> <br>

> I get the desirability of welcoming unicode into Erlang. But can't we come up with friendlier nomenclature or, at least revise the docs so they don't sound like copy-and-paste out a academic linguistics journal? <br>

<br>

<br>

   Most of the other words you might want to use are already in use<br>

for other things. Modern (computer) representation of writing systems<br>

is complicated, and there's not enough words to go round the existing<br>

concepts. Particularly words without well-known and either misleading<br>

or overly-narrow definitions -- see my comment on "letters", below.<br>

<br>

   For the two particular words you're complaining of here, I think of<br>

them thus:<br>

<br>

   graphemes, like graphology(*), are to do with the way that<br>

   something's written on the page -- the shape and composition of the<br>

   symbols. It's essentially a letter plus all of its diacritics (but<br>

   it's not defined as such, because there are some graphemes that are<br>

   ligatures of two or more letters, and some languages where each<br>

   grapheme is a word in its own right).<br>

<br>

   lexemes, like a lexicon, are to do with words, and are therefore<br>

   groups of (certain kinds of) graphemes.<br>

<br>

   Hugo.<br>

<br>

(*) For all that it's unsubstantiated in its psychometric claims.<br>

<br>

-- <br>

Hugo Mills             | I can't foretell the future, I just work there.<br>

hugo@... <a href="http://carfax.org.uk" rel="noreferrer" target="_blank">carfax.org.uk</a> |<br>

<a href="http://carfax.org.uk/" rel="noreferrer" target="_blank">http://carfax.org.uk/</a>  |<br>

PGP: E2AB1DE4          |                                            The Doctor<br>

_______________________________________________<br>

erlang-questions mailing list<br>

<a href="mailto:erlang-questions@erlang.org" target="_blank">erlang-questions@erlang.org</a><br>

<a href="http://erlang.org/mailman/listinfo/erlang-questions" rel="noreferrer" target="_blank">http://erlang.org/mailman/listinfo/erlang-questions</a><br>

</blockquote></div>

</div></blockquote></div></body></html>