[erlang-questions] correct terminology for referring to strings
Richard Carlsson
carlsson.richard@REDACTED
Tue Jul 31 18:49:07 CEST 2012
On 07/31/2012 06:03 PM, Fred Hebert wrote:
> Your post seemed to imply that converting to single code point
> representation is good enough. I do not understand how that distinction
> solves the problem of string reversal as I wrote it here, though.
It doesn't. That's what I said: "reversing a Unicode string is a bad
idea anyway because it could contain combining characters". But I also
clarified that on the Erlang level, at runtime, strings will contain
single code points rather than a UTF-8 encoded byte sequence, so for the
particular example of "a∞b" it happens to work. Nothing more, nothing less.
> I would expect, as a user of some string data type or bytestring that
> claims to support unicode, that reversing a string with the characters "
> ́e" would give me "e ́". Single code point representation or not.
Yes. That's why there needs to be a new Unicode-aware string library.
Operating directly on lists (e.g. using lists:reverse/1, or even
length/1) is always going to have surprising effects, and the old
'string' module in stdlib probably can't be modernized while maintaining
backwards compatibility.
> The concept of cluster has to be understood for it to make sense.
Grapheme clusters are actually one of the things you don't need to think
too much about unless you're writing an editor or similar where you need
to figure out between which code points to move the cursor or select a
sequence of code points based on what the user points to on the screen.
Combining characters are a much more basic thing and need to be
understood by pretty much anyone working with Unicode.
/Richard
More information about the erlang-questions
mailing list