Strings (was: Re: are Mnesia tables immutable?)
Romain Lenglet
rlenglet@REDACTED
Thu Jun 29 05:14:12 CEST 2006
Andrew Lentvorski wrote:
> Romain Lenglet wrote:
> > The most efficient is still most often to use an official
> > 8-bit encoding for strings. E.g. for Thai, TIS-620 is the
> > most efficient, for Japanese, ISO-2022 (or others) is the
> > most efficient, etc.
>
> Really? ISO-2022? How does that beat UTF-16?
ISO-2022 allows to *combine* encodings, by specifying "switch"
escape characters. One switch specifies that the following bytes
are 7-bit ASCII, other switches specify that the following bytes
are in some JIS-* encoding, etc.
That way, every sequence of characters can be encoded using the
best encoding for that sequence of characters, preceded by the
switch escape characters for that encoding.
I admit that I don't really know if ISO-2022-JP is actually more
efficient than UTF-16 (I have not found figures), but such an
adaptive encoding / compression is potentially more efficient
than UTF-16, IMHO.
> IIRC, UTF-16 manages to account for all of the Joyou Kanji as
> well as kana in two bytes. Given that the Kana account for
> close to 90 entries off the top, that only leaves the upper
> 128 bytes for Kanji.
>
> That isn't much.
>
> In addition, Japanese mixes Kanji, Kana, and Roman characters
> fairly fluidly on the web.
>
> I find it very difficult to believe that any "byte"-based
> encoding beats UTF-16 by very much for any of the languages
> which use Kanji.
I can guarantee you that nobody uses UTF-* here in Japan.
Although all email software I have seen support UTF-*, everybody
uses either ISO-2022-JP or directly JIS encodings. The reason
they invoke is efficiency. Even well educated people who
understand the advantages of using a common encoding such as
UTF-*. Or maybe that's a kind of superstition... ;-)
--
Romain LENGLET
More information about the erlang-questions
mailing list