Strings (was: Re: are Mnesia tables immutable?)

Wed Jun 28 23:17:02 CEST 2006

dda wrote:

> While common usage sinograms are in the neighbourhood of 2,000,
> depending on the language, my own [somewhat] limited dictionary has
> 50,000 of them, without counting alternative/ancient forms and
> vernacular characters [hangul letters and syllables, kata- and
> hiragana, bopomofo, chu nom chars and many other glyphs used by CJKV
> languages]. The first sinogram in my wife's name is not available in
> the Unicode standard, for instance.

Really?  Not even in the Supplementary Ideographic Plane (Plane 2)? 
Wow.  What's the sinogram?  I'm curious.

> This is why these countries, and
> Japan especially, are very reluctant to even prod Unicode with a
> 10-foot pole – although the UniHan project has a lot of CJKV people.

Well, there is also more than a little politics involved.  Part of it is 
"our characters are more important than their characters" a la "Our 
diplomats get to sit here and their diplomats get to sit there."

Contrary to popular belief, an official UTF-16 encoding is capable of 
accessing all of the planes (it uses an escape sequence to create a 
4-byte combination which can access the 20-bit code points). 
Technically, it's just a matter of adding the ideogram to the standard.

As the web becomes more and more fluid, "my language gets the first n 
characters so that my text is small" encodings just aren't going to 
work.  I already read forums in which people throw around English, 
Japanese, and Chinese fairly interchangeably in the same thread.  Anyone 
who doesn't use Unicode and UTF-8 gets a fast, harsh beatdown.

And don't get me started about the Japanese penchant for stars, hearts, 
and other dingbats characters (and bright pink--I think the red 
receptors in my eyes are burning out).

> In another programming language I use, string objects are encoding
> aware [and utf8 by default], and string operations are split among
> byte ops and character ops – eg len vs lenB – which is very
> convenient. There's a lot to learn from them.

Yeah, that seems to be the only real way to deal with this.  I default 
to UTF-8 unless I have a really good reason otherwise.  Even encoding 
ignorant programs (aka some databases) avoid molesting UTF-8 strings 
(this is not true for UTF-16--too many bytes of 0's).

> I have played for a
> while with a bare-ass utf-8 implementation in Erlang – geared towards
> CJKV. Things that look simple on paper like getting a substring of the
> X rightmost *chars* suddenly become a pita. But it's slowly coming
> around.

A lot of people fail to realize that even if you use something like 
UTF-32 (4-bytes per character), combining characters *still* make the 
delineation of "character boundaries" non-trivial.

-a