Strings (was: Re: are Mnesia tables immutable?)
Richard A. O'Keefe
Thu Jun 29 07:43:11 CEST 2006
> + CJK text requires 2n+1 bytes (one byte says "switch to 16-bit"),
> so SCSU is a good representation for CJK.
I meant, of course, "CJK text using BMP characters".
dda <headspin@REDACTED> wrote at length about the fact that are are
more CJK characters than the ones in the BMP. That isn't actually
relevant, because neither the variable-byte encoding that I outlined
nor SCSU has any 16-bit limit. Both of them can handle the *full*
20-and-fraction bits of Unicode.
I _did_ think that people would actually _look_ at SCSU before attacking
it. How naive of me.
By the way, here is just one example of why encoding-tagged strings
can be a pain. Suppose S1 is Greek text with preferred encoding
MacGreek and S2 is Hebrew text with preferred encoding MacHebrew.
What is the preferred encoding of S1 ++ S2 and why?
More information about the erlang-questions