Strings (was: Re: are Mnesia tables immutable?)
Wed Jun 28 23:17:02 CEST 2006
> While common usage sinograms are in the neighbourhood of 2,000,
> depending on the language, my own [somewhat] limited dictionary has
> 50,000 of them, without counting alternative/ancient forms and
> vernacular characters [hangul letters and syllables, kata- and
> hiragana, bopomofo, chu nom chars and many other glyphs used by CJKV
> languages]. The first sinogram in my wife's name is not available in
> the Unicode standard, for instance.
Really? Not even in the Supplementary Ideographic Plane (Plane 2)?
Wow. What's the sinogram? I'm curious.
> This is why these countries, and
> Japan especially, are very reluctant to even prod Unicode with a
> 10-foot pole – although the UniHan project has a lot of CJKV people.
Well, there is also more than a little politics involved. Part of it is
"our characters are more important than their characters" a la "Our
diplomats get to sit here and their diplomats get to sit there."
Contrary to popular belief, an official UTF-16 encoding is capable of
accessing all of the planes (it uses an escape sequence to create a
4-byte combination which can access the 20-bit code points).
Technically, it's just a matter of adding the ideogram to the standard.
As the web becomes more and more fluid, "my language gets the first n
characters so that my text is small" encodings just aren't going to
work. I already read forums in which people throw around English,
Japanese, and Chinese fairly interchangeably in the same thread. Anyone
who doesn't use Unicode and UTF-8 gets a fast, harsh beatdown.
And don't get me started about the Japanese penchant for stars, hearts,
and other dingbats characters (and bright pink--I think the red
receptors in my eyes are burning out).
> In another programming language I use, string objects are encoding
> aware [and utf8 by default], and string operations are split among
> byte ops and character ops – eg len vs lenB – which is very
> convenient. There's a lot to learn from them.
Yeah, that seems to be the only real way to deal with this. I default
to UTF-8 unless I have a really good reason otherwise. Even encoding
ignorant programs (aka some databases) avoid molesting UTF-8 strings
(this is not true for UTF-16--too many bytes of 0's).
> I have played for a
> while with a bare-ass utf-8 implementation in Erlang – geared towards
> CJKV. Things that look simple on paper like getting a substring of the
> X rightmost *chars* suddenly become a pita. But it's slowly coming
A lot of people fail to realize that even if you use something like
UTF-32 (4-bytes per character), combining characters *still* make the
delineation of "character boundaries" non-trivial.
More information about the erlang-questions