[erlang-questions] unicode in string literals

Tue Jul 31 10:02:58 CEST 2012

On 07/31/2012 09:32 AM, Michel Rijnders wrote:
> IMO this doesn't solve the problem, and only confuses the issue;
> consider the following:
>
> test() ->
>      io:format("~w~n", ["Just my €0.02"]),
>      io:format("~w~n", [lists:reverse("Just my €0.02")]).
>
>> test().
> [74,117,115,116,32,109,121,32,226,130,172,48,46,48,50]
> [50,48,46,48,172,130,226,32,121,109,32,116,115,117,74]

Yes, this is what happens today, because all involved parts (including 
the call to io:format with ~w) assumes Latin-1 and just passes all the 
bytes straight through. Basically, it's your editor and terminal that 
are lying by displaying a particular sequence of 3 bytes as € although 
the program is really using Latin-1. They conspire against you to make 
you think that things are working correctly.

> If the list data was kept as UTF-8 then the output of the second
> statement should be:
> [50,48,46,48,226,130,172,32,121,109,32,116,115,117,74]

That would only be the result if you used a single code point 
representation for the input to reverse, and then converted the result 
back to a byte encoding (e.g. by printing with ~ts).

> The above of course depends on whether you view strings as lists of
> bytes vs lists of characters.

Strings are lists of characters (code points), so when your example gets 
through tokenization, the encoding from the file would already be 
forgotten, and you'd have a single integer for the €. (The same goes for 
atoms and variable names, by the way, the answer to so Vlad's question 
is that these will also get a greater range of available characters.) 
String manipulation functions should assume they are working on single 
code points, not on a byte encoding.

     /Richard