Fri Jun 30 05:06:33 CEST 2006
David Hopwood wrote:
> Note that an implementation of strings using SCSU would need
> to guarantee that the final state of a string encoding is the
> same as the initial state ('single byte mode', *all* windows
> set back to the defaults). Otherwise, ++ cannot be implemented
> by direct concatenation; only by reencoding (which takes
> O(length(S1) + length(S2)) time, since we need to both
> reconstruct the state at the end of S1, and reencode S2).
That would not be a problem if (1) we keep the string
representation of strings as lists of code points, and if (2) we
encode/decode strings only in the external representation (in
term_to_binary/1 and binary_to_term/1). Strings are externally
encoded as a whole, and are well delimited. Concatenation of
encoded strings would never happen.
> Because of this, an encoding scheme without dynamic windows,
> or with a 'reset' code, might be preferable to SCSU.
> BOCU-1 <http://www.unicode.org/reports/tr40/tr40-1.html> shows
> that it is possible to get essentially the same compression
> ratios without any need for dynamic windows (see the table in
> section 6). BOCU-1 is also deterministic and
> codepoint-ordered, although only if resets are not used.
> (Note that BOCU-1 is patented, however. Despite that "IBM
> would like to offer a royalty free license to this patent upon
> request to implementers of a fully compliant version of
> BOCU-1", this might not be sufficient to satisfy some
> open-source licensing policies.)
IBM's ICU library implements SCSU and BOCU-1 (and several others,
cf. the UConverterType).
ICU is licensed under the modified BSD license. Couldn't that
library be used directly?
More information about the erlang-questions