[erlang-questions] correct terminology for referring to strings
Tue Jul 31 18:03:01 CEST 2012
Your post seemed to imply that converting to single code point
representation is good enough. I do not understand how that distinction
solves the problem of string reversal as I wrote it here, though.
I would expect, as a user of some string data type or bytestring that
claims to support unicode, that reversing a string with the characters "
́e" would give me "e ́". Single code point representation or not.
The concept of cluster has to be understood for it to make sense.
Regarding your latest post (I received it while writing this one).
Cursed be the problem of multiple environments. This is never going to
be easy to figure out!
On 12-07-31 11:50 AM, Richard Carlsson wrote:
> If you take another look at what I wrote, this is precisely what I was
> talking about. But you are confusing grapheme clusters with combining
> characters; they are not the same thing. A grapheme cluster is the
> next higher conceptual level, and a cluster could consist of multiple
> characters, each of which could be individually made up of a base
> character (such as "e") plus one or more combining characters (like
> U+0301 COMBINING ACUTE ACCENT).
> On 2012-07-31 17:19, Fred Hebert wrote:
>> Even then the reversal is not guaranteed.
>> The character 'é' can be represented, for example, in two ways:
>> é = U+00E9
>> e+ ́ = U+0065 + U+0301
>> The first one allows a representation as a single codepoint, but the
>> second one is a 'grapheme cluster', a sequence of codepoints
>> representing a single grapheme, a single unit of text. Grapheme clusters
>> can be larger than two elements, and as far as I know, you cannot
>> reverse them. The cluster should ideally remain in the same order in the
>> reversed string:
>> 2> io:format("~ts~n",[[16#0065,16#0301]]).
>> 3> io:format("~ts~n",[[16#0301,16#0065]]).
>> This is fine with your plan -- if I force a single code point
>> representation, this is a non-issue.
>> The tricky thing is that if I enter a string containing " ́e" in my
>> module and later reverse it, I will get "é" and not "e ́" as a final
>> result. What was initially [16#0301,16#0065] gets reversed into
>> [16#0065,16#0301], which is not the same as the correct visual
>> representation " ́e" (represented as ([16#0065, $ , 16#0301]), with an
>> implicit space in there)
>> It works one way (starting the right direction then reversing), but
>> without being very careful, it can break when going the other way
>> (starting with two non-combined code points that get assembled in the
>> same cluster when reversed).
>> Just changing to single code point representations isn't enough to make
>> sure nothing is broken.
>> On 12-07-31 10:04 AM, Richard Carlsson wrote:
>>> No, you're confusing Unicode (a sequence of code points) with specific
>>> encodings such as UTF-8 and UTF-16. The first is downwards compatible
>>> with Latin-1: the values from 128 to 255 are the same. In UTF-8
>>> they're not. At runtime, Erlang's strings are just plain sequences of
>>> Unicode code points (you can think of it as UTF-32 if you like).
>>> Whether the source code is encoded in UTF-8 or Latin-1 or any other
>>> encoding is irrelevant as long as the compiler knows how to transform
>>> the input to the single-codepoint representation.
>>> For example, reversing a Unicode string is a bad idea anyway because
>>> it could contain combining characters, and reversing the order of the
>>> codepoints in that case will create an illegal string. But an
>>> expression like lists:reverse("a∞b") will be working on the list [97,
>>> 8734, 98] (once the compiler has been extended to accept other
>>> encodings than Latin-1), not the list [97,226,136,158,98], so it will
>>> produce the intended "b∞a". This string might then become encoded as
>>> UTF-8 on its way to your terminal, but that's another story.
>>> erlang-questions mailing list
More information about the erlang-questions