[erlang-questions] correct terminology for referring to strings

Tue Jul 31 17:19:23 CEST 2012

Even then the reversal is not guaranteed.

The character 'é' can be represented, for example, in two ways:

é = U+00E9
e+ ? = U+0065 + U+0301

The first one allows a representation as a single codepoint, but the 
second one is a 'grapheme cluster', a sequence of codepoints 
representing a single grapheme, a single unit of text. Grapheme clusters 
can be larger than two elements, and as far as I know, you cannot 
reverse them. The cluster should ideally remain in the same order in the 
reversed string:

2> io:format("~ts~n",[[16#0065,16#0301]]).
e?
ok
3> io:format("~ts~n",[[16#0301,16#0065]]).
  ?e
ok

This is fine with your plan -- if I force a single code point 
representation, this is a non-issue.

The tricky thing is that if I enter a string containing " ?e" in my 
module and later reverse it, I will get "é" and not "e ?" as a final 
result. What was initially [16#0301,16#0065] gets reversed into 
[16#0065,16#0301], which is not the same as the correct visual 
representation " ?e" (represented as ([16#0065, $ , 16#0301]), with an 
implicit space in there)

  It works one way (starting the right direction then reversing), but 
without being very careful, it can break when going the other way 
(starting with two non-combined code points that get assembled in the 
same cluster when reversed).

Just changing to single code point representations isn't enough to make 
sure nothing is broken.

On 12-07-31 10:04 AM, Richard Carlsson wrote:
> No, you're confusing Unicode (a sequence of code points) with specific 
> encodings such as UTF-8 and UTF-16. The first is downwards compatible 
> with Latin-1: the values from 128 to 255 are the same. In UTF-8 
> they're not. At runtime, Erlang's strings are just plain sequences of 
> Unicode code points (you can think of it as UTF-32 if you like). 
> Whether the source code is encoded in UTF-8 or Latin-1 or any other 
> encoding is irrelevant as long as the compiler knows how to transform 
> the input to the single-codepoint representation.
>
> For example, reversing a Unicode string is a bad idea anyway because 
> it could contain combining characters, and reversing the order of the 
> codepoints in that case will create an illegal string. But an 
> expression like lists:reverse("a?b") will be working on the list [97, 
> 8734, 98] (once the compiler has been extended to accept other 
> encodings than Latin-1), not the list [97,226,136,158,98], so it will 
> produce the intended "b?a". This string might then become encoded as 
> UTF-8 on its way to your terminal, but that's another story.
>
>     /Richard
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20120731/40236a7e/attachment.htm>