[erlang-questions] byte() vs. char() use in documentation
Richard O'Keefe
ok@REDACTED
Fri May 6 07:34:22 CEST 2011
On 5/05/2011, at 7:16 PM, Lionel Cons wrote:
> AFAIK, the confusion comes from two different uses of the term "character".
>
> The "individual character" is at the heart of Unicode. Each individual
> character maps to a unique code point. For instance, a lowercase alpha is
> the character named "GREEK SMALL LETTER ALPHA" and maps to code point
> U+03B1. The Unicode code points are between 0 and 0x10FFFF.
>
> The "logical character" is what human beings usually have in mind. In the
> real world, a text is a sequence of logical characters. An example of such
> a character is the lowercase letter "e" with an acute accent.
One major problem here is that there is no such thing as a universally
agreed "logical character". In the context of my sister-in-law's name
(Chéri), "é" is a single accented letter. In the context of a word in
"belovéd", perhaps in an English poem, "é" is two conceptually *separate*
characters, a letter "e" and a prosodic marker. Now "é" can be encoded
in Unicode either as a single code point or as two code points, one of
them a floating diacritical, so we actually have four combinations:
one human character, one code point
one human character, two code points
two human characters, one code point
two human characters, two code points.
This is why I am totally unimpressed by suggestions that we represent
"characters" as binaries containing encoded sequences of code points;
without a fair chunk of natural language processing, not to mention
AI, WE DON'T KNOW WHICH SEQUENCES OF CODE POINTS COUNT AS SINGLE
HUMAN CHARACTERS.
That is why I say that everyday Unicode interfaces should be in terms
of strings and *only* strings.
>
> To come back to the point, we have to define what we mean with the Erlang
> char() type:
> - if it's an individual character then it can naturally be represented as
> a single integer for its code point
> - if it's a logical character then it has to be a list of integers
Since we cannot know what a logical character is, and since we need *some*
representation of Unicode code points, I recommend that char()=code point.
>
> In any case, the language must provide specific functions to work on strings
> and characters. For instance, a logical character comparison must take into
> account the Unicode equivalence.
What do you mean "THE" equivalence?\
More information about the erlang-questions
mailing list