Strings (was: Re: are Mnesia tables immutable?)

Tue Jun 27 18:19:12 CEST 2006

Richard A. O'Keefe wrote:
> Ah, Unicode.
> I've struggled to provide Unicode support in one language and
> written a draft library proposal for another.
> It's quite frighteningly hard.  (Interesting point:  the C99 standard
> does _not_ provide enough information about the current locale to
> implement POSIX regular expressions, and POSIX itself doesn't provide
> access to the information you need either.  Guess how come I found that
> out?)  I'm not even sure what "support" for Unicode language tags (the
> absence of which was one of the core design features originally, so it
> defies belief that they added them) would begin to look like.

Political compromise. They have essentially been deprecated from birth --
just ignore them. (There are a lot of things in Unicode that implementors
should just ignore; unfortunately this is not always apparent from reading
the spec.)

> Simply telling when two sequence of Unicode codepoints represent the
> "same" string is the very reverse of trivial.   Not that there is a
> single definition.  Unicode has four "normal forms", so there are
> five different definitions of "same", counting the trivial one,

No, there are three: equality, canonical equivalence, and compatibility
equivalence. Comparing NFC forms is the same as comparing NFD, and
comparing NFKC is the same as comparing NFKD.

Compatibility equivalence is another thing that you should just ignore --
use canonical equivalence.

> and _not_ counting differences between versions of Unicode (I see Unicode 5
> is on the horizon already).  And there isn't even a normal form with
> the property that is_nf(Xs) and is_nf(Ys) => is_nf(Xs ++ Ys).

Yes, that's irritating, as is the fact that case folding does not preserve
normalization.

-- 
David Hopwood <david.nospam.hopwood@REDACTED>