Strings (was: Re: are Mnesia tables immutable?)

Richard A. O'Keefe ok@REDACTED
Thu Jun 29 07:43:11 CEST 2006


I wrote:
	>     + CJK text requires 2n+1 bytes (one byte says "switch to 16-bit"),
	>       so SCSU is a good representation for CJK.

I meant, of course, "CJK text using BMP characters".

dda <headspin@REDACTED> wrote at length about the fact that are are
more CJK characters than the ones in the BMP.  That isn't actually
relevant, because neither the variable-byte encoding that I outlined
nor SCSU has any 16-bit limit.  Both of them can handle the *full*
20-and-fraction bits of Unicode.

I _did_ think that people would actually _look_ at SCSU before attacking
it.  How naive of me.

By the way, here is just one example of why encoding-tagged strings
can be a pain.  Suppose S1 is Greek text with preferred encoding
MacGreek and S2 is Hebrew text with preferred encoding MacHebrew.
What is the preferred encoding of S1 ++ S2 and why?



More information about the erlang-questions mailing list