[erlang-questions] byte() vs. char() use in documentation

Tue May 10 04:14:31 CEST 2011

Let's start by quoting the Unicode standard itself.
Chapter 5:

	Data types for “characters” generally hold just a single
	Unicode code point value for low-level processing and
	lookup of character property values.
	When a primitive data type is used for single-code point
	values, a signed integer type can be useful;
	negative values can hold “sentinel” values like
	end-of-string or end-of-file, which can be easily
	distinguished from Unicode code point values.
	However, in most APIs, string types should be used to
	accommodate user-perceived characters,
	which may require sequences of code points.

If I'm reading this correctly, the recommendation is to have TWO
ways of representing textual information in the same language and
even in the same program:
 BOTH a one 'character' = one code point = one integer version
  AND a one 'string' = one 'user-perceived character'.

It's truly unfortunate that determining what some particular user
perceives as a character is so hard.

On 9/05/2011, at 6:15 PM, Masklinn wrote:
>>> Why pick code points rather than grapheme cluster?
>> 
>> For so many many reasons I haven't the patience to list them all.
>> A.  Because it is the simplest thing that could possibly work.
> Sure. But so would machine integers be in Erlang, yet the core team
> made a different choice.
> 
> It is also, for most users, the most confusing choice by far.

The word "user" is hopelessly confusing in itself.
For end-users, the representation used in the programming language
should be a matter of complete indifference.
For *programmers*, I don't see "one thing defined in Unicode =
one thing manipulated in the program" as particularly confusing;
anything else (such as grapheme clusters) would confuse *me* a
great deal more.  Such confusion as exists is because Unicode is
confusing.  At least if people deal with the things that are 
defined by Unicode they can appeal to the Unicode standard itself
for help.

Note that
	Extended Combining Character Sequence (in ch3, s D56)
	Combining Character Sequence (in ch3, s D56)
 	Grapheme Cluster (in ch3, ss D57-D60)
	Extended Grapheme Cluster (in ch3, s D61)
are not identical.  Amongst other things, the grapheme base (D58)
of a grapheme cluster (D60) may be a block of syllables, which
would have counted as separate CCSs.  We are warned that grapheme
clusters "do not have linguistic significance" and I for one am
really worried about the licence given in D61 for "tailoring",
so that the question of which code point sequences count as a
grapheme cluster is NOT application-independent.

Then there is the fact that a text in Unicode will usually contain
many code points that do not form part of any grapheme cluster,
which means that "String = sequence of grapheme clusters" is NOT
one of the schemes 'that could possibly work'.
> 
>> B.  Because grapheme clusters aren't any better a fit to the user's
>>   perception of a "character" than code points.
> My experience runs counter to that. Do you have examples of codepoints
> sequences in which individual code points are better fit for a
> "user character" than the corresponding grapheme cluster?

I have already provided such an example: é.  In *SOME* cases that
counts as a single "character" for the end-user; in *SOME* cases
it counts as two "characters".  When I write
  "the noun is 'prótest', the verb 'protést'"
I perceive this as writing two conceptually distinct and
physically separate characters in the same column, a vowel
and a stress accent.  In handwriting, I might even write the
word and then go back and write the accent.  If I were searching
for "protest" I would be happy for the program to find either
of the marked occurrences.  

But let me turn this around.

There are code points in Unicode that don't correspond to
user-perceptible characters AT ALL, and therefore cannot
meaningfully be considered as part of "grapheme clusters".

Take the language tag code points in plane 14, for example.
Take them far away and burn them, as a matter of fact.
But there are plenty of other invisible code points with
semantic effects in Chapter 16 of Unicode 6.0.

Note, for example, that if a control or format character
is followed by one or more combining characters, the
combining characters form a "defective combining character
sequence" (chapter 3, D57), which does NOT include that
format or control character.

> 
>> C.  Because Unicode properties are defined for characters, not
>>   grapheme clusters.
> In most cases where this is of interest, either the cluster and 
> the codepoint have a 1:1 mapping or you're only interested in
> the properties of the base character.

How did you measure this?  

> Either way, this should
> not be much of an issue in 99.9% of cases (number pulled out of
> nowhere),

Ah.  "Number pulled out of nowhere."
Then I don't have to take it any more seriously than you did.

> and the base abstraction being grapheme clusters does
> not mean access to code points is prohibited to those who need
> such an access.

Conversely: the base abstraction being code points does
not mean that access to grapheme clusters is prohibited to those
who need such an access.

> 
>> D.  Because there is a finite and not *hopelessly* large set of
>>   code points, but the set of grapheme clusters is unbounded
> The set of grapheme cluster is no more unbounded than the set of
> code points. Larger, maybe (probably), but code points are  not
> arbitrarily combined into clusters.

Where did you get that idea?  The Unicode standard does not place
any limit on the number of combining characters that may be
attached to a base character (or may in the case of a defective
CCS be floating around without any base character).
In fact the book explicitly denies the existence of such a bound:

	This rendering behavior for nonspacing marks can be
	generalized to SEQUENCES OF ANY LENGTH, although practical
	considerations usually limit such sequences to no more
	than two or three marks above and/or below a grapheme base.

The practical considerations here refer to rendering; for text
that is not going to be rendered there are no such practical
considerations.  In any case, Unicode 4.0 had about 800
non-spacing marks, and 3 above + 3 below gives 2.6e17 variants
for *each* base character.  I agree that long combining
sequences may be *rare* (mostly occurring as jokes or test cases)
but they are *allowed*.

>> Yes, but there are *several* notions of equivalence in the Unicode
>> standard, which was my point.  Which of them is "THE" equivalence?
>> (The one with arguably the strongest claim does NOT deal with
>> compatibility mappings.)
> There are *two* equivalences, and one is a subset of the other.

With more than two normal forms, there are more than two notions of
equivalence.

Then too, each revision of Unicode subtly changes both the canonical
and the compatibility equivalences.
> 
> And I covered which equivalence would be most useful for a user non-versed
> in the details of Unicode (and — one should assume — not interested in them):
> compatibility equivalence,

However, that is NOT the notion of equivalence that is preferred
by the standard itself, which is canonical equivalence.
"All processes .. are required to abide by [the] conformance clause"
which defines replacing a character by a compatibility equivalent as
a modification to the interpretation of the text.

This means that a typical Erlang program, shipping data around the
network, MUST NOT do compatibility mapping by default.  That should
ONLY be done where a higher level protocol explicitly allows it.

As "Unicode in XML and other Markup Languages" (revision 8) says,
"It is never advisable to apply compatibility mappings indiscriminately."

A user non-versed in the details of Unicode should not be programming
Unicode applications in Erlang.  We agree 100% that components and
algorithms should be available to let Erlang programmers manipulate
text in ways appropriate for the application at hand.  Where we
disagree is just how much superhuman intelligence needs to be built
into the core.

> so that equivalence of ligatures is handled
> intuitively (as ligatures are typographical details which leaked into the
> standard, I think you would agree that, for most non-typographers, the
> aforementioned "ﬃ" and "ffi" are "the same thing", even though they are
> not canonically equivalent).

Indeed.  But that's really not going to work well enough.
I personally perceive the letter ash (Ææ) as a distinct letter,
not as a ligature.  Unicode calls it "LATIN (CAPITAL|SMALL) LETTER AE",
not a ligature, unlike "LATIN (CAPITAL|SMALL) LIGATURE OE".
In the English-speaking world, I suspect that most people would
perceive ash as an ae ligature.  If you want to handle ligatures
intuitively, you should regard Æ and AE as equivalent, even though
they are NOT compatibility equivalent.  This means that using the
somewhat risky compatibility equivalence is *NOT* sufficient to
handle ligatures intuitively.

Even with characters whose names include "LIGATURE", some (like
IJ) do have compatibility mappings, and some (like OE) do not.

In short, dealing with Unicode is flaming hard,
and using compatibilty-equivalent grapheme clusters is NOT going
to make many (if indeed any) of our problems go away.

>