<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
Even then the reversal is not guaranteed.<br>
<br>
The character 'é' can be represented, for example, in two ways:<br>
<br>
é =
<meta http-equiv="Content-Type" content="text/html;
charset=ISO-8859-1">
<title></title>
U+00E9<br>
e+ ́ = U+0065 + U+0301<br>
<br>
The first one allows a representation as a single codepoint, but the
second one is a 'grapheme cluster', a sequence of codepoints
representing a single grapheme, a single unit of text. Grapheme
clusters can be larger than two elements, and as far as I know, you
cannot reverse them. The cluster should ideally remain in the same
order in the reversed string:<br>
<br>
2> io:format("~ts~n",[[16#0065,16#0301]]).<br>
é<br>
ok<br>
3> io:format("~ts~n",[[16#0301,16#0065]]). <br>
́e<br>
ok<br>
<br>
This is fine with your plan -- if I force a single code point
representation, this is a non-issue.<br>
<br>
The tricky thing is that if I enter a string containing " ́e" in my
module and later reverse it, I will get "é" and not "e ́" as a final
result. What was initially [16#0301,16#0065] gets reversed into
[16#0065,16#0301], which is not the same as the correct visual
representation " ́e" (represented as ([16#0065, $ , 16#0301]), with
an implicit space in there)<br>
<br>
It works one way (starting the right direction then reversing), but
without being very careful, it can break when going the other way
(starting with two non-combined code points that get assembled in
the same cluster when reversed).<br>
<br>
Just changing to single code point representations isn't enough to
make sure nothing is broken.<br>
<br>
<div class="moz-cite-prefix">On 12-07-31 10:04 AM, Richard Carlsson
wrote:<br>
</div>
<blockquote cite="mid:5017E5D5.2030508@gmail.com" type="cite">No,
you're confusing Unicode (a sequence of code points) with specific
encodings such as UTF-8 and UTF-16. The first is downwards
compatible with Latin-1: the values from 128 to 255 are the same.
In UTF-8 they're not. At runtime, Erlang's strings are just plain
sequences of Unicode code points (you can think of it as UTF-32 if
you like). Whether the source code is encoded in UTF-8 or Latin-1
or any other encoding is irrelevant as long as the compiler knows
how to transform the input to the single-codepoint representation.
<br>
<br>
For example, reversing a Unicode string is a bad idea anyway
because it could contain combining characters, and reversing the
order of the codepoints in that case will create an illegal
string. But an expression like lists:reverse("a∞b") will be
working on the list [97, 8734, 98] (once the compiler has been
extended to accept other encodings than Latin-1), not the list
[97,226,136,158,98], so it will produce the intended "b∞a". This
string might then become encoded as UTF-8 on its way to your
terminal, but that's another story.
<br>
<br>
/Richard
<br>
<br>
_______________________________________________
<br>
erlang-questions mailing list
<br>
<a class="moz-txt-link-abbreviated" href="mailto:erlang-questions@erlang.org">erlang-questions@erlang.org</a>
<br>
<a class="moz-txt-link-freetext" href="http://erlang.org/mailman/listinfo/erlang-questions">http://erlang.org/mailman/listinfo/erlang-questions</a>
<br>
</blockquote>
<br>
</body>
</html>