[erlang-questions] byte() vs. char() use in documentation

Wed May 4 10:40:41 CEST 2011

On 2011-05-04, at 09:57 , Raimo Niskanen wrote:

> On Wed, May 04, 2011 at 05:33:58AM +1000, Anthony Shipman wrote:
>> On Tue, 3 May 2011 07:45:49 pm Raimo Niskanen wrote:
>>> The programmer should regard strings as a sequence of unicode code points.
>>> As such they are just that and there is no encoding to bother about.
>>> The code point number uniquely defines which unicode character it is.
>> 
>> As I recall, a Unicode character can be composed of up to 7 code points.
>> To quote a text book I'm looking at now:
>> -------------
>> The trick is, again, to disabuse yourself of the idea that a one-to-one 
>> correspondence exists between "characters" as the user is used to thinking of 
>> them and code points (or code units) in the backing store. Unicode uses the 
>> term "character" to mean more or less "the entity that's represented by a 
>> single Unicode code point," but this concept doesn't always match the user's 
>> definition of "character".
>> -------------
> 
> There seems to be a terminology here clash that I will remember for the future.
> When I talked about "Unicode code points" I ment the character number
> in the Unicode system. I did not think it was allowed to talk about "code points"
> when talking about byte encoded data.
Well, code points are abstract numbers, but UTF-32 (as far as I know) encodes the
code points as themselves. So many people make the shortcut (furthermore most
people aren't really interested in understanding Unicode — and I can understand
that, it's a drag — so they mix unicode-lingo with "normal" speech leading to
less-than-sensical results).

I believe the issue Anthony mentions here is the difference between glyphs and code
points (combining marks) rather than the difference between code points and
on-disk bytes (resulting from Unicode encoding): a "visible character" (e.g. į̇́)
can be composed of multiple code points, one "base" code point and a number of
combining marks code points (diacritics being the main offender) (nb: the glyph
"į̇́" is, in fact, composed of three code points: U+012F, U+0307 and U+0301).

What most users think of as a character is what unicode calls a glyph: it's the
graphical representation of a group of combined code points (that group may be
unary). Whereas in Unicode, a character is the graphical representation of a
single code point. As a result, a "user" character may be composed of a number
of "unicode" characters.