[erlang-questions] byte() vs. char() use in documentation

Thu May 5 02:03:28 CEST 2011

On 4/05/2011, at 7:33 AM, Anthony Shipman wrote:

> On Tue, 3 May 2011 07:45:49 pm Raimo Niskanen wrote:
>> The programmer should regard strings as a sequence of unicode code points.
>> As such they are just that and there is no encoding to bother about.
>> The code point number uniquely defines which unicode character it is.
> 
> As I recall, a Unicode character can be composed of up to 7 code points.

I find this rather confusing.
Here are some official definitions from the Unicode standard:

  Character.
    (1) The smallest component of written language that has semantic value;
        refers to the abstract meaning and/or shape, rather than a specific
        shape (see also glyph), though in code tables some form of visual
        representation is essential for the reader’s understanding.
    (2) Synonym for abstract character.
    (3) The basic unit of encoding for the Unicode character encoding.
    (4) The English name for the ideographic written elements of Chinese origin.
        [See ideograph (2).]

  Coded Character Set.
    A character set in which each character is assigned a numeric code point.
    Frequently abbreviated as character set, charset, or code set;
    the acronym CCS is also used.

  Code Point.
    (1) Any value in the Unicode codespace; that is, the range of integers
        from 0 to 10FFFF(base 16).
        (See definition D10 in Section 3.4, Characters and Encoding.)
    (2) A value, or position, for a character, in any coded character set.

  Code Unit.
    The minimal bit combination that can represent a unit of encoded text
    for processing or interchange.  The Unicode Standard uses 8-bit code units
    in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form,
    and 32-bit code units in the UTF-32 encoding form.
    (See definition D77 in Section 3.9, Unicode Encoding Forms.)

Each Unicode character has *BY DEFINITION* precisely *ONE* code point.
A code point is a number in the range 0 to 1,114,111.

The largest legal Unicode code point (hex 10FFFF) requires precisely
FOUR code units:

11110100 10001111 10111111 10111111      
-----  3 --     6 --     6 --     6

The 11110 prefix on the leading byte says "here are four bytes";
the "10" prefixes on the remaining bytes say "here are 6 more bits".

No Unicode code point requires more than four code units.

> To quote a text book I'm looking at now:
> -------------
> The trick is, again, to disabuse yourself of the idea that a one-to-one 
> correspondence exists between "characters" as the user is used to thinking of 
> them and code points (or code units) in the backing store.

Sorry, it looks as though you need a better text book.
Code points and code units are NOT the same thing (at least for UTF-8 and
UTF-16).

There IS, by definition, a direct correspondence between Unicode characters
and code points (not every code point has been assigned a character yet).

> Unicode uses the 
> term "character" to mean more or less "the entity that's represented by a 
> single Unicode code point," but this concept doesn't always match the user's 
> definition of "character".

And _that_ is talking about two other issues:
(1) Unicode classifies code points as Graphic, Format, Control, Private-Use,
    Surrogate, Noncharacter, or Reserved.  Only the Graphic characters are
    ones that users are likely to think of as characters.
(2) Things that the user thinks of as a character (like é) may be represented
    by sequences of code points, called Grapheme Clusters, consisting of a
    base character and some nonspacing marks.  This has nothing to do with
    encodings.

> I think a more complete design would represent a character as a binary that is 
> a UTF8 encoding of its code points. A string would then be a deep list of 
> these binaries.

Once again, a Unicode character has *by definition* one code point;
and from a storage point of view, it's pretty silly to use a big thing like
a binary to represent a 21-bit integer.

The main principle to understand about Unicode is to *always* think in terms of
strings, not of characters.