[erlang-questions] Strings - deprecated functions

Thu Nov 23 03:36:35 CET 2017

lloyd@REDACTED wrote:
> I read further in the Erlang docs and I see "grapheme cluster."  WHAT THE
> ... IS GRAPHEME CLUSTER?
>
> I look up "grapheme" in Merriam-Webster. Oh it is now all so clear: "a
> unit of a writing system."

"Grapheme" and "grapheme cluster" are technical terms in Unicode.
The best place to look is probably UAX#29 Unicode Text Segmentantion
http://unicode.org/reports/tr29/
Section 3 begins with this paragraph, which should help:
  It is important to recognize that what the user thinks of
  as a “character”—a basic unit of a writing system for a
  language—may not be just a single Unicode code point.
  Instead, that basic unit may be made up of multiple Unicode
  code points.  To avoid ambiguity with the computer use of the
  term character, this is called a user-perceived character.
  For example, “G” + acute-accent is a user-perceived character:
  users think of it as a single character, yet is actually
  represented by two Unicode code points.  These user-perceived
  characters are approximated by what is called a grapheme cluster,
  which can be determined programmatically.

> It sounds like someone took a linguistics class and is trying to show off.

It would be pretty horrifying if many of the people defining Unicode
hadn't taken a linguistics class or three...  It's actually a very
obvious practical problem:  suppose you are in your favorite editor
and press the "move forward 1 character" key.  The distinction between
Unicode and UTF-8 makes it sufficiently clear that this doesn't mean
"move forward one byte" (C), and the distinction between Unicode and
UTF-16 makes it sufficiently clear that it doesn't mean "move forward
one 16-bit char" (Java).  But it doesn't mean "move forward one Unicode
code point" either.  There is no limit in principle to the number of
code points in a user-perceived character.  Figuring out just how many
code points in a "character" (= grapheme cluster) is sufficiently
tricky that you do not want to do it yourself.

The text *I* generate is almost exclusively Latin-1, but it is less
and less common for me to *get* data in that form.  I too would like
full retention of Latin-1 support >>for data I am fully in control of<<.