[erlang-questions] correct terminology for referring to strings

Richard Carlsson carlsson.richard@REDACTED
Tue Jul 31 18:49:07 CEST 2012


On 07/31/2012 06:03 PM, Fred Hebert wrote:
> Your post seemed to imply that converting to single code point
> representation is good enough. I do not understand how that distinction
> solves the problem of string reversal as I wrote it here, though.

It doesn't. That's what I said: "reversing a Unicode string is a bad 
idea anyway because it could contain combining characters". But I also 
clarified that on the Erlang level, at runtime, strings will contain 
single code points rather than a UTF-8 encoded byte sequence, so for the 
particular example of "a∞b" it happens to work. Nothing more, nothing less.

> I would expect, as a user of some string data type or bytestring that
> claims to support unicode, that reversing a string with the characters "
> ́e" would give me "e ́". Single code point representation or not.

Yes. That's why there needs to be a new Unicode-aware string library. 
Operating directly on lists (e.g. using lists:reverse/1, or even 
length/1) is always going to have surprising effects, and the old 
'string' module in stdlib probably can't be modernized while maintaining 
backwards compatibility.

> The concept of cluster has to be understood for it to make sense.

Grapheme clusters are actually one of the things you don't need to think 
too much about unless you're writing an editor or similar where you need 
to figure out between which code points to move the cursor or select a 
sequence of code points based on what the user points to on the screen. 
Combining characters are a much more basic thing and need to be 
understood by pretty much anyone working with Unicode.

     /Richard




More information about the erlang-questions mailing list