Strings (was: Re: are Mnesia tables immutable?)

dda headspin@REDACTED
Wed Jun 28 13:43:32 CEST 2006


On 6/28/06, Richard A. O'Keefe <ok@REDACTED> wrote:
[snip]
>     + CJK text requires 2n+1 bytes (one byte says "switch to 16-bit"),
>       so SCSU is a good representation for CJK.

Nope. 16 bit is 65536 chars tops, and this is not going to cut it.
While common usage sinograms are in the neighbourhood of 2,000,
depending on the language, my own [somewhat] limited dictionary has
50,000 of them, without counting alternative/ancient forms and
vernacular characters [hangul letters and syllables, kata- and
hiragana, bopomofo, chu nom chars and many other glyphs used by CJKV
languages]. The first sinogram in my wife's name is not available in
the Unicode standard, for instance. This is why these countries, and
Japan especially, are very reluctant to even prod Unicode with a
10-foot pole – although the UniHan project has a lot of CJKV people.
See the Ruby Talk list and Matz's arguments in favout of m17n for
instance... [Although I think it is kind of selfish from Matz to try
and impose a complex and convoluted encoding-aware String
implementation just because a fewRuby users – including himself –
don't want unicode. But I digress].

The Mojikyo project http://www.mojikyo.org/html/abroad/index_e.html
has around 80,000 characters – 70,000+ of which are sinograms –
although it is based around different fonts rather than a X-bit  (with
X>16) encoding scheme, which seems to me a tad deluded, but anyway...
Same goes for the Korean word process HWP, which has internally more
characters than you could possibly need, and use separate fonts to
achieve that.

In another programming language I use, string objects are encoding
aware [and utf8 by default], and string operations are split among
byte ops and character ops – eg len vs lenB – which is very
convenient. There's a lot to learn from them. I have played for a
while with a bare-ass utf-8 implementation in Erlang – geared towards
CJKV. Things that look simple on paper like getting a substring of the
X rightmost *chars* suddenly become a pita. But it's slowly coming
around.

-- 
Didier



More information about the erlang-questions mailing list