[erlang-questions] byte() vs. char() use in documentation

Raimo Niskanen raimo+erlang-questions@REDACTED
Thu May 5 09:49:21 CEST 2011


On Thu, May 05, 2011 at 09:16:39AM +0200, Lionel Cons wrote:
> AFAIK, the confusion comes from two different uses of the term "character".
> 
> The "individual character" is at the heart of Unicode. Each individual
> character maps to a unique code point. For instance, a lowercase alpha is
> the character named "GREEK SMALL LETTER ALPHA" and maps to code point
> U+03B1. The Unicode code points are between 0 and 0x10FFFF.
> 
> The "logical character" is what human beings usually have in mind. In the
> real world, a text is a sequence of logical characters. An example of such
> a character is the lowercase letter "e" with an acute accent.
> 
> Some logical characters do not map directly to individual characters and
> must be represented as a combination of several individual characters (this
> is called I think an "extended grapheme cluster").
> 
> Some logical characters do map to individual characters and can therefore
> have two different representations in Unicode:
>  - with an individual character
>  - with a combination of several individual characters
> 
> For instance, our "e" with an acute accent can be represented as:
>  - the individual character "LATIN SMALL LETTER E WITH ACUTE" (U+00E9)
> or
>  - the combination "LATIN SMALL LETTER E" (U+0065) plus "COMBINING ACUTE
>    ACCENT" (U+0301)
> 
> To cope with this, Unicode defines the notions of canonical and compatible
> equivalence (see http://en.wikipedia.org/wiki/Unicode_equivalence).
> 
> To come back to the point, we have to define what we mean with the Erlang
> char() type:
>  - if it's an individual character then it can naturally be represented as
>    a single integer for its code point
>  - if it's a logical character then it has to be a list of integers

The Erlang char() type today must then according to your excellent clarification
be defined as a Unicode individual character, range 0 upto 0x10FFFF (there are
invalid values, right).

> 
> In any case, the language must provide specific functions to work on strings
> and characters. For instance, a logical character comparison must take into
> account the Unicode equivalence.

That is as far as I know unimplemented funcionality. Some may fit into
the unicode module, and some might be left to a text processing application
to implement. Just to implement Unicode equivalence sounds complicated
and as a moving target, or somthing best implemented by OS libraries.

> 
> Cheers,
> __________________________________________________________
> Lionel Cons        http://cern.ch/lionel.cons
> CERN               http://cern.ch
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions

-- 

/ Raimo Niskanen, Erlang/OTP, Ericsson AB



More information about the erlang-questions mailing list