[erlang-questions] byte() vs. char() use in documentation
Thu May 5 09:16:39 CEST 2011
AFAIK, the confusion comes from two different uses of the term "character".
The "individual character" is at the heart of Unicode. Each individual
character maps to a unique code point. For instance, a lowercase alpha is
the character named "GREEK SMALL LETTER ALPHA" and maps to code point
U+03B1. The Unicode code points are between 0 and 0x10FFFF.
The "logical character" is what human beings usually have in mind. In the
real world, a text is a sequence of logical characters. An example of such
a character is the lowercase letter "e" with an acute accent.
Some logical characters do not map directly to individual characters and
must be represented as a combination of several individual characters (this
is called I think an "extended grapheme cluster").
Some logical characters do map to individual characters and can therefore
have two different representations in Unicode:
- with an individual character
- with a combination of several individual characters
For instance, our "e" with an acute accent can be represented as:
- the individual character "LATIN SMALL LETTER E WITH ACUTE" (U+00E9)
- the combination "LATIN SMALL LETTER E" (U+0065) plus "COMBINING ACUTE
To cope with this, Unicode defines the notions of canonical and compatible
equivalence (see http://en.wikipedia.org/wiki/Unicode_equivalence).
To come back to the point, we have to define what we mean with the Erlang
- if it's an individual character then it can naturally be represented as
a single integer for its code point
- if it's a logical character then it has to be a list of integers
In any case, the language must provide specific functions to work on strings
and characters. For instance, a logical character comparison must take into
account the Unicode equivalence.
Lionel Cons http://cern.ch/lionel.cons
More information about the erlang-questions