[erlang-questions] correct terminology for referring to strings
Fred Hebert
mononcqc@REDACTED
Tue Jul 31 17:19:23 CEST 2012
Even then the reversal is not guaranteed.
The character 'é' can be represented, for example, in two ways:
é = U+00E9
e+ ? = U+0065 + U+0301
The first one allows a representation as a single codepoint, but the
second one is a 'grapheme cluster', a sequence of codepoints
representing a single grapheme, a single unit of text. Grapheme clusters
can be larger than two elements, and as far as I know, you cannot
reverse them. The cluster should ideally remain in the same order in the
reversed string:
2> io:format("~ts~n",[[16#0065,16#0301]]).
e?
ok
3> io:format("~ts~n",[[16#0301,16#0065]]).
?e
ok
This is fine with your plan -- if I force a single code point
representation, this is a non-issue.
The tricky thing is that if I enter a string containing " ?e" in my
module and later reverse it, I will get "é" and not "e ?" as a final
result. What was initially [16#0301,16#0065] gets reversed into
[16#0065,16#0301], which is not the same as the correct visual
representation " ?e" (represented as ([16#0065, $ , 16#0301]), with an
implicit space in there)
It works one way (starting the right direction then reversing), but
without being very careful, it can break when going the other way
(starting with two non-combined code points that get assembled in the
same cluster when reversed).
Just changing to single code point representations isn't enough to make
sure nothing is broken.
On 12-07-31 10:04 AM, Richard Carlsson wrote:
> No, you're confusing Unicode (a sequence of code points) with specific
> encodings such as UTF-8 and UTF-16. The first is downwards compatible
> with Latin-1: the values from 128 to 255 are the same. In UTF-8
> they're not. At runtime, Erlang's strings are just plain sequences of
> Unicode code points (you can think of it as UTF-32 if you like).
> Whether the source code is encoded in UTF-8 or Latin-1 or any other
> encoding is irrelevant as long as the compiler knows how to transform
> the input to the single-codepoint representation.
>
> For example, reversing a Unicode string is a bad idea anyway because
> it could contain combining characters, and reversing the order of the
> codepoints in that case will create an illegal string. But an
> expression like lists:reverse("a?b") will be working on the list [97,
> 8734, 98] (once the compiler has been extended to accept other
> encodings than Latin-1), not the list [97,226,136,158,98], so it will
> produce the intended "b?a". This string might then become encoded as
> UTF-8 on its way to your terminal, but that's another story.
>
> /Richard
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20120731/40236a7e/attachment.htm>
More information about the erlang-questions
mailing list