[erlang-questions] unicode in string literals

Richard O'Keefe <>
Thu Aug 2 03:42:43 CEST 2012


On 1/08/2012, at 6:57 PM, Masklinn wrote:

> On 2012-08-01, at 06:14 , Richard O'Keefe wrote:
> 
>> And having a
>> distinct data type is no protection against that problem:  Java
>> and Javascript both have opaque string datatypes, but both
>> allow slicing a well formed string into pieces that are not
>> well formed.
> 
> To be fair, they've got the further compounding issue that strings types
> are dedicated but not opaque: they are sequences of UTF-16 code units
> (on account of originally being UCS2 sequences).

You are right.  I should not have "opaque".  The implementation
is *encapsulated*, but the fact that it's a slice of an array of
16-bit units shows through.

As it happens, I *wasn't* referring to the possibility of splitting
a codepoint between two surrogates.  If we restrict our attention to
the Basic Multilingual Plane, it is *still* possible to slice a
well formed BMP string into pieces that are not well formed.  I have
in mind things like the way Apple used to have two plus signs, one
for left to right text and one for right to left text, but since
Unicode has only one, the way to encode א+ב was
[Aleph, left-to-right override, plus, pop directional formatting, Beth],
and a division that gives the left part either 2 or 3 codepoints is one
that gives you two strings that make no sense.
 
As it happens, I don't know any programming language that deals with
this.  My basic point is that any data structure for text that
*doesn't* ensure that all the 'strings' you deal with are well formed
has already lost its virginity and might as well be frankly and openly
just a sequence of code points.






More information about the erlang-questions mailing list