[erlang-questions] byte() vs. char() use in documentation

Thu May 5 08:03:32 CEST 2011

On 2011-05-05, at 02:03 , Richard O'Keefe wrote:
> On 4/05/2011, at 7:33 AM, Anthony Shipman wrote:
> 
>> On Tue, 3 May 2011 07:45:49 pm Raimo Niskanen wrote:
>>> The programmer should regard strings as a sequence of unicode code points.
>>> As such they are just that and there is no encoding to bother about.
>>> The code point number uniquely defines which unicode character it is.
>> 
>> As I recall, a Unicode character can be composed of up to 7 code points.
> 
> I find this rather confusing.
> Here are some official definitions from the Unicode standard:
> 
>  Character.
>    (1) The smallest component of written language that has semantic value;
>        refers to the abstract meaning and/or shape, rather than a specific
>        shape (see also glyph), though in code tables some form of visual
>        representation is essential for the reader’s understanding.
>    (2) Synonym for abstract character.
>    (3) The basic unit of encoding for the Unicode character encoding.
>    (4) The English name for the ideographic written elements of Chinese origin.
>        [See ideograph (2).]
> 
>  Coded Character Set.
>    A character set in which each character is assigned a numeric code point.
>    Frequently abbreviated as character set, charset, or code set;
>    the acronym CCS is also used.
> 
>  Code Point.
>    (1) Any value in the Unicode codespace; that is, the range of integers
>        from 0 to 10FFFF(base 16).
>        (See definition D10 in Section 3.4, Characters and Encoding.)
>    (2) A value, or position, for a character, in any coded character set.
> 
>  Code Unit.
>    The minimal bit combination that can represent a unit of encoded text
>    for processing or interchange.  The Unicode Standard uses 8-bit code units
>    in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form,
>    and 32-bit code units in the UTF-32 encoding form.
>    (See definition D77 in Section 3.9, Unicode Encoding Forms.)
> 
> Each Unicode character has *BY DEFINITION* precisely *ONE* code point.
> A code point is a number in the range 0 to 1,114,111.
> 
> The largest legal Unicode code point (hex 10FFFF) requires precisely
> FOUR code units:
> 
> 11110100 10001111 10111111 10111111      
> -----  3 --     6 --     6 --     6
> 
> The 11110 prefix on the leading byte says "here are four bytes";
> the "10" prefixes on the remaining bytes say "here are 6 more bits".
UTF-8 makes allowance for full 31 bits code points (6 code units when encoded)
though, which may trip people up.

> No Unicode code point requires more than four code units.
For now.

> And _that_ is talking about two other issues:
I strongly disagree. I believe this is the *core* of the whole issue, and
*this* is the reason why people are confused: a complete mastery of the unicode
lingo (which the standard's definitions does not even provide, as you mentioned
in your comment "Character has 4 different definitions, most of which can not
be understood by the lay man) and a very good capacity to differentiate common
speech and unicode lingo are necessary to navigate unicode discussions correctly.

The vast majority of developers do *not* possess these (not necessarily for lack
of trying), and the differences in the status of (mostly) the word "character"
(which can be hard to understand from context) lead to a minefield of
misunderstanding and frustration.

I strongly believe it was a mistake for the Unicode consortium to use this word.