Strings

Thu Jun 29 21:19:11 CEST 2006

Richard A. O'Keefe wrote:
> I wrote:
> 	>     + CJK text requires 2n+1 bytes (one byte says "switch to 16-bit"),
> 	>       so SCSU is a good representation for CJK.
> 
> I meant, of course, "CJK text using BMP characters".
> 
> dda <headspin@REDACTED> wrote at length about the fact that are are
> more CJK characters than the ones in the BMP.  That isn't actually
> relevant, because neither the variable-byte encoding that I outlined
> nor SCSU has any 16-bit limit.  Both of them can handle the *full*
> 20-and-fraction bits of Unicode.
> 
> I _did_ think that people would actually _look_ at SCSU before attacking
> it.  How naive of me.
> 
> By the way, here is just one example of why encoding-tagged strings
> can be a pain.  Suppose S1 is Greek text with preferred encoding
> MacGreek and S2 is Hebrew text with preferred encoding MacHebrew.
> What is the preferred encoding of S1 ++ S2 and why?

Note that an implementation of strings using SCSU would need to guarantee
that the final state of a string encoding is the same as the initial state
('single byte mode', *all* windows set back to the defaults). Otherwise, ++
cannot be implemented by direct concatenation; only by reencoding (which
takes O(length(S1) + length(S2)) time, since we need to both reconstruct
the state at the end of S1, and reencode S2).

Because of this, an encoding scheme without dynamic windows, or with a
'reset' code, might be preferable to SCSU.
BOCU-1 <http://www.unicode.org/reports/tr40/tr40-1.html> shows that it is
possible to get essentially the same compression ratios without any need
for dynamic windows (see the table in section 6). BOCU-1 is also deterministic
and codepoint-ordered, although only if resets are not used.

(Note that BOCU-1 is patented, however. Despite that "IBM would like to offer
a royalty free license to this patent upon request to implementers of a fully
compliant version of BOCU-1", this might not be sufficient to satisfy some
open-source licensing policies.)

Given that strings are immutable, use of a stateful, compressed encoding
is more practical in Erlang than it would be in a language in which strings
are mutable. OTOH, UTF-8 would be simpler, and I think that the concerns
about encoding efficiency of UTF-8 for CJK languages have been somewhat
overstated in this thread.

For example, here are the sizes of some translations of the "What is Unicode?"
page at <http://www.unicode.org/unicode/standard/WhatIsUnicode.html> in UTF-8
(just the translated text, with the HTML header and trailer removed):

  Simplified Chinese:  3463 bytes
  English:             4422 bytes
  Korean:              5023 bytes
  Japanese:            5345 bytes

Chinese and Korean have a higher information density than English (and other
alphabetic scripts) to start with; in the case of Chinese, this more than
compensates for any inefficiency of UTF-8 when encoding translated texts.
This example is not atypical.

IOW, it is more accurate to say that UTF-16 and other 16-bit encodings are
particularly efficient for these languages, rather than UTF-8 being inefficient.

-- 
David Hopwood <david.nospam.hopwood@REDACTED>