[erlang-questions] byte() vs. char() use in documentation
Masklinn
masklinn@REDACTED
Thu May 5 08:03:32 CEST 2011
On 2011-05-05, at 02:03 , Richard O'Keefe wrote:
> On 4/05/2011, at 7:33 AM, Anthony Shipman wrote:
>
>> On Tue, 3 May 2011 07:45:49 pm Raimo Niskanen wrote:
>>> The programmer should regard strings as a sequence of unicode code points.
>>> As such they are just that and there is no encoding to bother about.
>>> The code point number uniquely defines which unicode character it is.
>>
>> As I recall, a Unicode character can be composed of up to 7 code points.
>
> I find this rather confusing.
> Here are some official definitions from the Unicode standard:
>
> Character.
> (1) The smallest component of written language that has semantic value;
> refers to the abstract meaning and/or shape, rather than a specific
> shape (see also glyph), though in code tables some form of visual
> representation is essential for the reader’s understanding.
> (2) Synonym for abstract character.
> (3) The basic unit of encoding for the Unicode character encoding.
> (4) The English name for the ideographic written elements of Chinese origin.
> [See ideograph (2).]
>
> Coded Character Set.
> A character set in which each character is assigned a numeric code point.
> Frequently abbreviated as character set, charset, or code set;
> the acronym CCS is also used.
>
> Code Point.
> (1) Any value in the Unicode codespace; that is, the range of integers
> from 0 to 10FFFF(base 16).
> (See definition D10 in Section 3.4, Characters and Encoding.)
> (2) A value, or position, for a character, in any coded character set.
>
> Code Unit.
> The minimal bit combination that can represent a unit of encoded text
> for processing or interchange. The Unicode Standard uses 8-bit code units
> in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form,
> and 32-bit code units in the UTF-32 encoding form.
> (See definition D77 in Section 3.9, Unicode Encoding Forms.)
>
> Each Unicode character has *BY DEFINITION* precisely *ONE* code point.
> A code point is a number in the range 0 to 1,114,111.
>
> The largest legal Unicode code point (hex 10FFFF) requires precisely
> FOUR code units:
>
> 11110100 10001111 10111111 10111111
> ----- 3 -- 6 -- 6 -- 6
>
> The 11110 prefix on the leading byte says "here are four bytes";
> the "10" prefixes on the remaining bytes say "here are 6 more bits".
UTF-8 makes allowance for full 31 bits code points (6 code units when encoded)
though, which may trip people up.
> No Unicode code point requires more than four code units.
For now.
> And _that_ is talking about two other issues:
I strongly disagree. I believe this is the *core* of the whole issue, and
*this* is the reason why people are confused: a complete mastery of the unicode
lingo (which the standard's definitions does not even provide, as you mentioned
in your comment "Character has 4 different definitions, most of which can not
be understood by the lay man) and a very good capacity to differentiate common
speech and unicode lingo are necessary to navigate unicode discussions correctly.
The vast majority of developers do *not* possess these (not necessarily for lack
of trying), and the differences in the status of (mostly) the word "character"
(which can be hard to understand from context) lead to a minefield of
misunderstanding and frustration.
I strongly believe it was a mistake for the Unicode consortium to use this word.
More information about the erlang-questions
mailing list