[erlang-questions] unicode in string literals
Richard Carlsson
carlsson.richard@REDACTED
Tue Jul 31 10:02:58 CEST 2012
On 07/31/2012 09:32 AM, Michel Rijnders wrote:
> IMO this doesn't solve the problem, and only confuses the issue;
> consider the following:
>
> test() ->
> io:format("~w~n", ["Just my €0.02"]),
> io:format("~w~n", [lists:reverse("Just my €0.02")]).
>
>> test().
> [74,117,115,116,32,109,121,32,226,130,172,48,46,48,50]
> [50,48,46,48,172,130,226,32,121,109,32,116,115,117,74]
Yes, this is what happens today, because all involved parts (including
the call to io:format with ~w) assumes Latin-1 and just passes all the
bytes straight through. Basically, it's your editor and terminal that
are lying by displaying a particular sequence of 3 bytes as € although
the program is really using Latin-1. They conspire against you to make
you think that things are working correctly.
> If the list data was kept as UTF-8 then the output of the second
> statement should be:
> [50,48,46,48,226,130,172,32,121,109,32,116,115,117,74]
That would only be the result if you used a single code point
representation for the input to reverse, and then converted the result
back to a byte encoding (e.g. by printing with ~ts).
> The above of course depends on whether you view strings as lists of
> bytes vs lists of characters.
Strings are lists of characters (code points), so when your example gets
through tokenization, the encoding from the file would already be
forgotten, and you'd have a single integer for the €. (The same goes for
atoms and variable names, by the way, the answer to so Vlad's question
is that these will also get a greater range of available characters.)
String manipulation functions should assume they are working on single
code points, not on a byte encoding.
/Richard
More information about the erlang-questions
mailing list