Strings (was: Re: are Mnesia tables immutable?)

Thu Jun 29 05:14:12 CEST 2006

Andrew Lentvorski wrote:
> Romain Lenglet wrote:
> > The most efficient is still most often to use an official
> > 8-bit encoding for strings. E.g. for Thai, TIS-620 is the
> > most efficient, for Japanese, ISO-2022 (or others) is the
> > most efficient, etc.
>
> Really?  ISO-2022?  How does that beat UTF-16?

ISO-2022 allows to *combine* encodings, by specifying "switch" 
escape characters. One switch specifies that the following bytes 
are 7-bit ASCII, other switches specify that the following bytes 
are in some JIS-* encoding, etc.

That way, every sequence of characters can be encoded using the 
best encoding for that sequence of characters, preceded by the 
switch escape characters for that encoding.

I admit that I don't really know if ISO-2022-JP is actually more 
efficient than UTF-16 (I have not found figures), but such an 
adaptive encoding / compression is potentially more efficient 
than UTF-16, IMHO.

> IIRC, UTF-16 manages to account for all of the Joyou Kanji as
> well as kana in two bytes.   Given that the Kana account for
> close to 90 entries off the top, that only leaves the upper
> 128 bytes for Kanji.
>
> That isn't much.
>
> In addition, Japanese mixes Kanji, Kana, and Roman characters
> fairly fluidly on the web.
>
> I find it very difficult to believe that any "byte"-based
> encoding beats UTF-16 by very much for any of the languages
> which use Kanji.

I can guarantee you that nobody uses UTF-* here in Japan. 
Although all email software I have seen support UTF-*, everybody 
uses either ISO-2022-JP or directly JIS encodings. The reason 
they invoke is efficiency. Even well educated people who 
understand the advantages of using a common encoding such as 
UTF-*. Or maybe that's a kind of superstition... ;-)

-- 
Romain LENGLET