[erlang-questions] unicode in string literals
Wed Aug 1 08:57:43 CEST 2012
On 2012-08-01, at 06:14 , Richard O'Keefe wrote:
> And having a
> distinct data type is no protection against that problem: Java
> allow slicing a well formed string into pieces that are not
> well formed.
To be fair, they've got the further compounding issue that strings types
are dedicated but not opaque: they are sequences of UTF-16 code units
(on account of originally being UCS2 sequences).
As a result, not only do you have the usual Unicode issues which may or
may not be (non-trivially) solvable (with grapheme-aware unicode handling)
that's further compounded by the ability to see and break apart
surrogate pairs (so you can e.g. split a string in the middle of a
CPython 3.3 has implemented a fully opaque string type, it exposes unicode
codepoints (if I remember correctly) but that may or may not be the
underlying binary data (the underlying representation can dynamically switch
between latin-1, UCS2 and UCS4)
 Which also needs to be locale-aware, for instance a conversion to
lower/upper case is not a 1:1 mapping in unicode as different cultures
may have different uppercases for the same lower and the other way
around, the usual example being Turkish in which "I"'s lowercase is "ı"
and the uppercase of "i" is "İ")
More information about the erlang-questions