[erlang-questions] byte() vs. char() use in documentation

Thu May 5 09:16:39 CEST 2011

AFAIK, the confusion comes from two different uses of the term "character".

The "individual character" is at the heart of Unicode. Each individual
character maps to a unique code point. For instance, a lowercase alpha is
the character named "GREEK SMALL LETTER ALPHA" and maps to code point
U+03B1. The Unicode code points are between 0 and 0x10FFFF.

The "logical character" is what human beings usually have in mind. In the
real world, a text is a sequence of logical characters. An example of such
a character is the lowercase letter "e" with an acute accent.

Some logical characters do not map directly to individual characters and
must be represented as a combination of several individual characters (this
is called I think an "extended grapheme cluster").

Some logical characters do map to individual characters and can therefore
have two different representations in Unicode:
 - with an individual character
 - with a combination of several individual characters

For instance, our "e" with an acute accent can be represented as:
 - the individual character "LATIN SMALL LETTER E WITH ACUTE" (U+00E9)
or
 - the combination "LATIN SMALL LETTER E" (U+0065) plus "COMBINING ACUTE
   ACCENT" (U+0301)

To cope with this, Unicode defines the notions of canonical and compatible
equivalence (see http://en.wikipedia.org/wiki/Unicode_equivalence).

To come back to the point, we have to define what we mean with the Erlang
char() type:
 - if it's an individual character then it can naturally be represented as
   a single integer for its code point
 - if it's a logical character then it has to be a list of integers

In any case, the language must provide specific functions to work on strings
and characters. For instance, a logical character comparison must take into
account the Unicode equivalence.

Cheers,
__________________________________________________________
Lionel Cons        http://cern.ch/lionel.cons
CERN               http://cern.ch