Strings

Fri Jun 30 05:06:33 CEST 2006

David Hopwood wrote:
> Note that an implementation of strings using SCSU would need
> to guarantee that the final state of a string encoding is the
> same as the initial state ('single byte mode', *all* windows
> set back to the defaults). Otherwise, ++ cannot be implemented
> by direct concatenation; only by reencoding (which takes
> O(length(S1) + length(S2)) time, since we need to both
> reconstruct the state at the end of S1, and reencode S2).

That would not be a problem if (1) we keep the string 
representation of strings as lists of code points, and if (2) we 
encode/decode strings only in the external representation (in 
term_to_binary/1 and binary_to_term/1). Strings are externally 
encoded as a whole, and are well delimited. Concatenation of 
encoded strings would never happen.

> Because of this, an encoding scheme without dynamic windows,
> or with a 'reset' code, might be preferable to SCSU.
> BOCU-1 <http://www.unicode.org/reports/tr40/tr40-1.html> shows
> that it is possible to get essentially the same compression
> ratios without any need for dynamic windows (see the table in
> section 6). BOCU-1 is also deterministic and
> codepoint-ordered, although only if resets are not used.
>
> (Note that BOCU-1 is patented, however. Despite that "IBM
> would like to offer a royalty free license to this patent upon
> request to implementers of a fully compliant version of
> BOCU-1", this might not be sufficient to satisfy some
> open-source licensing policies.)

IBM's ICU library implements SCSU and BOCU-1 (and several others, 
cf. the UConverterType).
http://icu.sourceforge.net/apiref/icu4c/ucnv_8h.html

ICU is licensed under the modified BSD license. Couldn't that 
library be used directly?

-- 
Romain LENGLET