String representation in erlang

Thu Sep 15 02:28:15 CEST 2005

David Hopwood <david.nospam.hopwood@REDACTED>
appears to have taken exception to my statement that
	> Java claims to support Unicode, but there are already
	> quite a lot of Unicode characters which don't fit in 16 bits, so
	> many Unicode characters require two Java "chars".  Needless to say,
	> this stuffs up Java string indexing something wonderful.

Let's clarify things a bit.

First, Unicode is seriously weird, AND they keep fiddling with the
rules, so things like "which characters should I treat as spaces"
have been known to change from one version of Unicode (2.0) to another
(3.0), and I am not just talking about adding new characters, but about
the status of existing characters.

Second, Unicode is complex.
That's because even if we restrict our attention to alphabetic writing
systems used by major languages or language groups, there is plenty of
weirdness about, and Unicode tries to handle them all.  There are
scripts where one phoneme is represented by two graphemes which are
thought of as separate graphemes, like "th", "sh", and "ph" in English,
there are scripts where there are symbols for phoneme x and phoneme y
but although y is pronounced after x, the symbol for y may go before,
above, below, or even inside the symbol for x, there are scripts where
the same logical symbol has a different shape depending on where it is
in the word (English used to do this with "s", using "long s" -- which
looks like an "f" without a cross-bar -- in some contexts and the modern
s elsewhere), there are scripts where an accented letter is thought of
as a letter plus an accent (like stress accent and diaeresis in English)
and others where visibly the same combination is treated as a different
letter.  I struggled for some time to come up with an unambiguous way
to say "what is a character in Unicode", but I can't.

However, we CAN say these things about Unicode.

(1) Each version of Unicode defines a fixed set of Unicode characters;
    Unicode 4.0.0 lists 96,513 of them (excluding surrogates, private
    use codes, and reserved = not-yet-defined codes).

=>  There is a well defined (but time dependent) notion of
    "a Unicode character."  Unicode character numbers fit into 21
    bits.  There are programming languages and Web API definitions
    that call for string indexing to be in terms of Unicode characters.

(2) Some of these are combining characters (also known as "floating
    diacriticals") that are meant to be attached to "base characters".
    There is no bound in principle on the number of floating
    diacriticals that could be attached to a single base character.
    This means that the set of Unicode characters, sensu latu, is
    infinite.  The Unicode book points out sorrowfully that there isn't
    any general way to refer to a "user-perceived character", but does
    say that there is 'a core concept of "characters that should be
    kept together" ... that can be defined in a language-independent
    way' and calls this a "grapheme cluster".

=>  There is a well defined (but time dependent) notion of
    "a Unicode grapheme cluster" containing one or more Unicode characters
    that are supposed to be kept together.  The only "standard" I am aware
    of that calls for some way of accessing or stepping through a string
    in units of grapheme clusters is a proposal I wrote for Prolog, but
    there may be others.

(3) There are many encoding forms for Unicode.  Amongst others,
    UTF-16 replaces each Unicode character outside the 16-bit range
    by a pair of special codes called "surrogates", each of which
    carries 10 bits of the whole code.

=>  There are programming languages (including Java) and Web API
    definitions that call for string indexing to be in terms of
    UTF-16 encoding units.

=>  Since Java's 'char' type represents a *single* UTF-16 encoding
    units, there are something like 34 000 currently defined Unicode
    characters which cannot be held in a Java char'.
    Since Java's String type represents a sequence of UTF-16 encoding
    units, there is no way to refer to the ith *Unicode character*
    without parsing the string at least up to that point.

(4) The fact that there are Web standards which call for indexing in
    terms of *Unicode characters* and other Web standards which call
    for indexing in terms of *UTF-16 encoding units* is clearly
    regrettable.  If memory serves me, XPath and the DOM are examples
    of standards that disagree in this way.

The terminology I am going to use here is anything but standard,
but I hope it will make things clear.

David Hopwood <david.nospam.hopwood@REDACTED> wrote
	Indexing into UTF-16 works fine (as does indexing into UTF-8).

Indexing into UTF-16 works fine.  I never said it didn't.
The problem is that there IS a well defined notion of a Unicode character,
and there IS a well defined notion of indexing into a Unicode string,
BUT indexing into UTF-16 doesn't do that.

I don't see the mismatch between UTF-8 and Unicode as much of a problem.
UTF-8 codes range from 1 to 4(?) bytes, and not only that, the number
of bytes per character varies within the dear old Latin 1 range (so
ASCII is 1 byte, accented letters are 2 bytes).  Programmers learn very
quickly that indexing into byte strings is not at all the same as
indexing into Unicode strings.

However, if you WANT "give me Unicode characters from index s to index f"
but you are GIVEN "UTF-16 encoding units from index s to index f", that
is going to seem to work.  In fact there are too many Java textbooks
telling their victims that these are the same thing.  Because the new
characters are not widely used yet, it's pretty much certain to be
one's customers who run into weird off-by-a-few problems rather than
oneself.

	There's only a problem if you assume that it's necessary to
	index by code point instead of by code unit.

In some web standards, it *is*.

	Note that in Unicode, abstract characters are represented by *sequences*
	of code points (because of combining marks).

Section 2.10, near the foot of page 46 in the Unicode 4.0 book,
defines "grapheme cluster".
Section 3.4, definition D3 says that an abstract character
   - has no concrete form
   - should not be confused with a glyph
   - does not necessarily correspond to what a user calls a character
   - should not be confused with a grapheme
   - may be encoded in Unicode and is then called a Unicode abstract character
   - might be representable using combining characters; such abstract
     characters do not have any special name
   - some abstract characters are not representable at all
There are no operations on nor properties listed for any abstract characters
other than the Unicode abstract characters.

	So a "position" in a Unicode string, regardless of whether it is
	a code point index or a code unit index, does not in general
	correspond to a count of abstract characters from the start of the
	string.

Nobody ever said it did.  For one thing, when looking at the same visible
marks, two different people may disagree about what the abstract characters
*are*, let alone how many of them there are.  Indexing in terms of
abstract characters is of no practical use to anyone, because there is
no general cross-language cross-script agreement about what they are.
(Example:  is Dz one letter or two?  To me, two.  To my grandfather
Ivan, may be rest in peace, one.  In fact there _is_ a Unicode character
Dz=U+01F2, also a dz=U+01F3 and a DZ=U+01F1.  Try as I might, I can't
help thinking of them as two characters each, even though I know why
they exist and have tried to learn the language they're used in.)

Yes, it would be nice to be able to index in terms of abstract
characters.  But since we can't agree about what that means, there's
no use crying for the moon.

We *could* ask for indexing or string slicing in terms of Unicode
grapheme clusters, because we DO have a well defined notion of what
that is, and in my Prolog library specification I *have* asked for it.
But the result has to be a string.

	And for most purposes, this makes no difference; all you need is
	a position, not a count.

You mistake the issue.
Yes, all we need is a position.
But that position must have SOME definite value.
Let's take XPath 1.0 as a specific example, where XPath 1.0 section 4.2
defines

    Function: string <B>substring</B>(string, number, number?)

    The substring function returns the substring of the first argument
    starting at the position specified in the second argument with
    length specified in the third argument.  For example,
	substring("12345", 2, 3)
    returns "234".  If the third argument is not specified, it returns
    the substring starting at the position specified in the second
    argument and continuing to the end of the string.  For example,
        substring("12345", 2)
    returns "2345".

    More precisely, each character in the string (see <A>[3.6 Strings]</A>)
    is considered to have a numeric position: the position of the first
    character is 1, the position of the second character is 2, and so on.

    ...	

    Function number <B>string-length</B>(string)

    The string-length function returns the number of characters in the
    string (see <A>[3.6 Strings]</A>).  ...

So what is a character?  XPath 1.0 section 3.6 tells us

    Strings consist of a sequence of zero or more characters, where a
    character is defined in the XML Recommendation.  A single character in
    XPath thus corresponds to a single Unicode abstract character

-- note, this means an abstract character which is represented DIRECTLY
-- in the Unicode standard as a SINGLE character, not a longer sequence
-- of characters!

    with a single corresponding scalar value; this is not the same thing
    as a 16-bit Unicode code value: the Unicode coded character representation

-- note, they really mean UTF-16, what they next say is not true of UCS4.

    for an abstract character with Unicode scalar value greater that [sic.]
    U+FFFF is apair of 16-bit Unicode code values (a surrogate pair).
    In many programming languages, a string is represented by a sequence
    of 16-bit Unicode values; implementations of XPath in such languages
    must take care to ensure that a surrogate pair is correctly treated as
    a single XPath character.

*THIS* is what I was referring to when I said that Java string indexing
was stuffed up.  If you want to interpret

    substring(., 27, 42)

in XPath, simply doing

    currentNode.stringValue.substring(27-1, 27-1+42)

will give you the right answer often enough to trick you into expecting it
to work all the time, but it WON'T.  In fact there is NO method in the
java.lang.String class that does this job.  (The Java API is so huge these
days that doubtless there is a suitable method *somewhere*, but the on-line
docs don't point you to it.)

	[and furthermore, non-surrogate code points are only used for
	characters that aren't likely to have any special interpretation in a
	given format/syntax]

I'm not sure what the relevance of this is.

My point is this.  People are using Erlang to process XML.
We have Xmerl, for example, which has some level of "xpath" support.
For XML, we have XPath 1.0, in which string lengths and substrings
are specifiied in terms of single Unicode characters, explicitly
NOT 16-bit units and NOT grapheme clusters or any other kind of
"abstract character" than ones directly encoded as single Unicode characters.
We also have XPointer (TR/xptr-xpointer), an extension of XPath, which
defines lengths and ranges the same way.  We also have XSLT, which again
builds on XPath, and defines string lengths and substrings the same way.

The DOM specifies 16-bit units, but the DOM does not specify stuff you
might actually find in an XML document.  Rather, the DOM is a
language-neutral-as-long-as-the-language-is-Java-or-ECMAScript and
rather clumsy kind of stuff to put in imperative programs; the model
relies on side effects to mutable data structures, making it unsuitable
for Erlang.

So the Web standards that *could* fit into an Erlang way of doing things
call for strings to be treated AS IF they were sequences of single Unicode
characters.  For this purpose, a list of 21-bit integers is quite
convenient, while a packed byte representation is less convenient.

Binaries could be *made* convenient for processing Unicode strings in
an XPath-compatible way (I don't expect indexing into Unicode data to
be O(1); I said "convenient", not "efficient") and this would require
extra library support.