[erlang-questions] byte() vs. char() use in documentation

Richard O'Keefe ok@REDACTED
Fri May 6 07:34:22 CEST 2011


On 5/05/2011, at 7:16 PM, Lionel Cons wrote:

> AFAIK, the confusion comes from two different uses of the term "character".
> 
> The "individual character" is at the heart of Unicode. Each individual
> character maps to a unique code point. For instance, a lowercase alpha is
> the character named "GREEK SMALL LETTER ALPHA" and maps to code point
> U+03B1. The Unicode code points are between 0 and 0x10FFFF.
> 
> The "logical character" is what human beings usually have in mind. In the
> real world, a text is a sequence of logical characters. An example of such
> a character is the lowercase letter "e" with an acute accent.

One major problem here is that there is no such thing as a universally
agreed "logical character".  In the context of my sister-in-law's name
(Chéri), "é" is a single accented letter.  In the context of a word in
"belovéd", perhaps in an English poem, "é" is two conceptually *separate*
characters, a letter "e" and a prosodic marker.  Now "é" can be encoded
in Unicode either as a single code point or as two code points, one of
them a floating diacritical, so we actually have four combinations:

	one human character, one code point
	one human character, two code points
	two human characters, one code point
	two human characters, two code points.

This is why I am totally unimpressed by suggestions that we represent
"characters" as binaries containing encoded sequences of code points;
without a fair chunk of natural language processing, not to mention
AI, WE DON'T KNOW WHICH SEQUENCES OF CODE POINTS COUNT AS SINGLE
HUMAN CHARACTERS.

That is why I say that everyday Unicode interfaces should be in terms
of strings and *only* strings.
> 
> To come back to the point, we have to define what we mean with the Erlang
> char() type:
> - if it's an individual character then it can naturally be represented as
>   a single integer for its code point
> - if it's a logical character then it has to be a list of integers

Since we cannot know what a logical character is, and since we need *some*
representation of Unicode code points, I recommend that char()=code point.
> 
> In any case, the language must provide specific functions to work on strings
> and characters. For instance, a logical character comparison must take into
> account the Unicode equivalence.

What do you mean "THE" equivalence?\




More information about the erlang-questions mailing list