[erlang-questions] byte() vs. char() use in documentation
Raimo Niskanen
raimo+erlang-questions@REDACTED
Thu May 5 09:49:21 CEST 2011
On Thu, May 05, 2011 at 09:16:39AM +0200, Lionel Cons wrote:
> AFAIK, the confusion comes from two different uses of the term "character".
>
> The "individual character" is at the heart of Unicode. Each individual
> character maps to a unique code point. For instance, a lowercase alpha is
> the character named "GREEK SMALL LETTER ALPHA" and maps to code point
> U+03B1. The Unicode code points are between 0 and 0x10FFFF.
>
> The "logical character" is what human beings usually have in mind. In the
> real world, a text is a sequence of logical characters. An example of such
> a character is the lowercase letter "e" with an acute accent.
>
> Some logical characters do not map directly to individual characters and
> must be represented as a combination of several individual characters (this
> is called I think an "extended grapheme cluster").
>
> Some logical characters do map to individual characters and can therefore
> have two different representations in Unicode:
> - with an individual character
> - with a combination of several individual characters
>
> For instance, our "e" with an acute accent can be represented as:
> - the individual character "LATIN SMALL LETTER E WITH ACUTE" (U+00E9)
> or
> - the combination "LATIN SMALL LETTER E" (U+0065) plus "COMBINING ACUTE
> ACCENT" (U+0301)
>
> To cope with this, Unicode defines the notions of canonical and compatible
> equivalence (see http://en.wikipedia.org/wiki/Unicode_equivalence).
>
> To come back to the point, we have to define what we mean with the Erlang
> char() type:
> - if it's an individual character then it can naturally be represented as
> a single integer for its code point
> - if it's a logical character then it has to be a list of integers
The Erlang char() type today must then according to your excellent clarification
be defined as a Unicode individual character, range 0 upto 0x10FFFF (there are
invalid values, right).
>
> In any case, the language must provide specific functions to work on strings
> and characters. For instance, a logical character comparison must take into
> account the Unicode equivalence.
That is as far as I know unimplemented funcionality. Some may fit into
the unicode module, and some might be left to a text processing application
to implement. Just to implement Unicode equivalence sounds complicated
and as a moving target, or somthing best implemented by OS libraries.
>
> Cheers,
> __________________________________________________________
> Lionel Cons http://cern.ch/lionel.cons
> CERN http://cern.ch
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
--
/ Raimo Niskanen, Erlang/OTP, Ericsson AB
More information about the erlang-questions
mailing list