<div class="gmail_quote">On Tue, Jul 31, 2012 at 4:04 PM, Richard Carlsson <span dir="ltr"><<a href="mailto:carlsson.richard@gmail.com" target="_blank">carlsson.richard@gmail.com</a>></span> wrote:</div><div class="gmail_quote">

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">On 07/31/2012 01:48 PM, Ian wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<< A "string" is a list of integers where the integers<br>

       represent Unicode codepoints. >><br>

</blockquote>

<br>

I think this is technically correct, but it is very confusing because it<br>

implies that the source file may be encoded as unicode.<br>

<br>

As I understand it, source files are always treated as being in Latin-1.<br>

This means that string literals are lists of Latin-1 values, and not<br>

lists of unicode codepoints. (The values from 128 to 255 have<br>

different/no meanings, and values > 255 will not happen).<br>

</blockquote>

<br></div>

No, you're confusing Unicode (a sequence of code points) with specific encodings such as UTF-8 and UTF-16. The first is downwards compatible with Latin-1: the values from 128 to 255 are the same. In UTF-8 they're not. At runtime, Erlang's strings are just plain sequences of Unicode code points (you can think of it as UTF-32 if you like). Whether the source code is encoded in UTF-8 or Latin-1 or any other encoding is irrelevant as long as the compiler knows how to transform the input to the single-codepoint representation.<br>


<br>

For example, reversing a Unicode string is a bad idea anyway because it could contain combining characters, and reversing the order of the codepoints in that case will create an illegal string. But an expression like lists:reverse("a∞b") will be working on the list [97, 8734, 98] (once the compiler has been extended to accept other encodings than Latin-1...</blockquote>

<div><br></div><div>Actually, try this:</div><div><br></div><div>1. set your environment to UTF-8 (in my case, whatever Linux terminal with BASH environment, export LANG="en_US.utf8", use locale to find your environment language definition - "en_US.latin1" for LATIN-1)</div>

<div>2. in a module:</div><div><br></div><div>test_reverse(String) -> lists:reverse(String).</div><div><br></div><div>3. Give as parameter the example given by yourself.</div><div>4. Check the output.</div><div><br></div>

<div>Pretty interesting to see how Erlang "knows" about UTF-8 encoding, isn't it? (You can try directly in the shell lists:reverse("a∞b") and it will transform as expected (using 3-elements list).) Actually, it knows nothing about, but relying on the environment to extract the integers for the list (which it mimics here the knowledge about UTF-8).</div>

<div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">...), not the list [97,226,136,158,98], so it will produce the intended "b∞a". This string might then become encoded as UTF-8 on its way to your terminal, but that's another story.</blockquote>

<div><br></div><div>I would add to the last part ("on its way to your terminal") also "from" and not leaving only "on" (it seems that the both ways are valid even if that can break the code).</div>

<div><br></div><div>I agree that for string literals, what you said is always true.</div><div> </div><div>CGS</div><div><br></div></div>