[erlang-questions] correct terminology for referring to strings
Tue Jul 31 17:50:11 CEST 2012
If you take another look at what I wrote, this is precisely what I was
talking about. But you are confusing grapheme clusters with combining
characters; they are not the same thing. A grapheme cluster is the next
higher conceptual level, and a cluster could consist of multiple
characters, each of which could be individually made up of a base
character (such as "e") plus one or more combining characters (like
U+0301 COMBINING ACUTE ACCENT).
On 2012-07-31 17:19, Fred Hebert wrote:
> Even then the reversal is not guaranteed.
> The character 'é' can be represented, for example, in two ways:
> é = U+00E9
> e+ ́ = U+0065 + U+0301
> The first one allows a representation as a single codepoint, but the
> second one is a 'grapheme cluster', a sequence of codepoints
> representing a single grapheme, a single unit of text. Grapheme clusters
> can be larger than two elements, and as far as I know, you cannot
> reverse them. The cluster should ideally remain in the same order in the
> reversed string:
> 2> io:format("~ts~n",[[16#0065,16#0301]]).
> 3> io:format("~ts~n",[[16#0301,16#0065]]).
> This is fine with your plan -- if I force a single code point
> representation, this is a non-issue.
> The tricky thing is that if I enter a string containing " ́e" in my
> module and later reverse it, I will get "é" and not "e ́" as a final
> result. What was initially [16#0301,16#0065] gets reversed into
> [16#0065,16#0301], which is not the same as the correct visual
> representation " ́e" (represented as ([16#0065, $ , 16#0301]), with an
> implicit space in there)
> It works one way (starting the right direction then reversing), but
> without being very careful, it can break when going the other way
> (starting with two non-combined code points that get assembled in the
> same cluster when reversed).
> Just changing to single code point representations isn't enough to make
> sure nothing is broken.
> On 12-07-31 10:04 AM, Richard Carlsson wrote:
>> No, you're confusing Unicode (a sequence of code points) with specific
>> encodings such as UTF-8 and UTF-16. The first is downwards compatible
>> with Latin-1: the values from 128 to 255 are the same. In UTF-8
>> they're not. At runtime, Erlang's strings are just plain sequences of
>> Unicode code points (you can think of it as UTF-32 if you like).
>> Whether the source code is encoded in UTF-8 or Latin-1 or any other
>> encoding is irrelevant as long as the compiler knows how to transform
>> the input to the single-codepoint representation.
>> For example, reversing a Unicode string is a bad idea anyway because
>> it could contain combining characters, and reversing the order of the
>> codepoints in that case will create an illegal string. But an
>> expression like lists:reverse("a∞b") will be working on the list [97,
>> 8734, 98] (once the compiler has been extended to accept other
>> encodings than Latin-1), not the list [97,226,136,158,98], so it will
>> produce the intended "b∞a". This string might then become encoded as
>> UTF-8 on its way to your terminal, but that's another story.
>> erlang-questions mailing list
More information about the erlang-questions