[erlang-questions] unicode in string literals

Wed Aug 1 08:57:43 CEST 2012

On 2012-08-01, at 06:14 , Richard O'Keefe wrote:

> And having a
> distinct data type is no protection against that problem:  Java
> and Javascript both have opaque string datatypes, but both
> allow slicing a well formed string into pieces that are not
> well formed.

To be fair, they've got the further compounding issue that strings types
are dedicated but not opaque: they are sequences of UTF-16 code units
(on account of originally being UCS2 sequences).

As a result, not only do you have the usual Unicode issues which may or
may not be (non-trivially) solvable (with grapheme-aware unicode handling[0])
that's further compounded by the ability to see and break apart
surrogate pairs (so you can e.g. split a string in the middle of a
surrogate pair).

CPython 3.3 has implemented a fully opaque string type, it exposes unicode
codepoints (if I remember correctly) but that may or may not be the
underlying binary data (the underlying representation can dynamically switch
between latin-1, UCS2 and UCS4)

[0] Which also needs to be locale-aware, for instance a conversion to
    lower/upper case is not a 1:1 mapping in unicode as different cultures
    may have different uppercases for the same lower and the other way
    around, the usual example being Turkish in which "I"'s lowercase is "ı"
    and the uppercase of "i" is "İ")