<p dir="ltr">So currently if you encode your source code as UTF-8, string literals become the literal byte sequences. This is different than in the shell where string literals get automatically turned into their Unicode codepoints.  </p>


<p dir="ltr">It appears that the solution is a compiler flag that tells the compiler that string literals should be decoded as UTF-8. So when the compiler reads the byte sequence 16#C3A9 it knows that it should be a 233 in the list because 16#C3A9 is the UTF-8 encoded sequence for the codepoint 233. </p>


<p dir="ltr">It don't know what the overall support the chardata() and charlist() is in the standard lib so doing this may cause many headaches when someone tries to stuff a charlist() where a iolist()  goes or chardata() where a string() goes. This may introduce subtle bugs that only occur when non-latin-1 characters are used. </p>


<p dir="ltr">Eric.</p>

<div class="gmail_quote">On Aug 1, 2012 4:39 AM, "Richard Carlsson" <<a href="mailto:carlsson.richard@gmail.com">carlsson.richard@gmail.com</a>> wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

On 08/01/2012 12:52 AM, CGS wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Actually, try this:<br>

<br>

1. set your environment to UTF-8 (in my case, whatever Linux terminal<br>

with BASH environment, export LANG="en_US.utf8", use locale to find your<br>

environment language definition - "en_US.latin1" for LATIN-1)<br>

2. in a module:<br>

<br>

test_reverse(String) -> lists:reverse(String).<br>

<br>

3. Give as parameter the example given by yourself.<br>

4. Check the output.<br>

</blockquote>

<br>

Ah, but when you say "give as parameter" you mean "pass it a string literal from the shell", right? I never said anything about strings in the shell - that's a different environment from source files, and as you described, the shell nowadays detects your locale and translates UTF-8 console input into a string literal containing Unicode code points. This is exactly how it would happen in source code as well, if the compiler only knew how to detect that a source file is in a different encoding from Latin1. So the compiler is really the main thing that needs to be fixed, and then there should be no surprises on the encoding level anymore.<br>


<br>

    /Richard<br>

<br>

______________________________<u></u>_________________<br>

erlang-questions mailing list<br>

<a href="mailto:erlang-questions@erlang.org" target="_blank">erlang-questions@erlang.org</a><br>

<a href="http://erlang.org/mailman/listinfo/erlang-questions" target="_blank">http://erlang.org/mailman/<u></u>listinfo/erlang-questions</a><br>

</blockquote></div>