Strings (was: Re: are Mnesia tables immutable?)

Andrew Lentvorski bsder@REDACTED
Thu Jun 29 08:42:05 CEST 2006


Richard A. O'Keefe wrote:

> I _did_ think that people would actually _look_ at SCSU before attacking
> it.  How naive of me.

I did.  I just don't see much advantage to the added annoyance.

http://www.unicode.org/reports/tr6

"Switch to Unicode mode for uncompressible text.
SCSU does not provide for window definitions for the main Han and Hangul 
character ranges, which are too large for effective use of dynamic 
windows. The Unicode mode should also be used for large scripts using 
supplementary code points."

So, the only language this benefits is basically Japanese.  And, even 
then, the true benfit is suspect.

If you look at the Japanese example, the difference is 178 bytes vs. 232 
bytes.  That's not a great compression ratio given the highly regular 
code points and the sample text is highly biased toward 
Kana(compressible) rather than Kanji(uncompressible).

This is the standard problem with trying to "compress" text. 
Compressing text at the character level almost always loses.  Even a 
crummy LZW would pack 30hex into a single bit.

-a



More information about the erlang-questions mailing list