[erlang-questions] source file encoding

Tue Dec 20 13:00:01 CET 2011

On 12/19/2011 11:18 AM, Justus wrote:
> Hi all,
>
> Strings must be in the ISO-latin-1 character set. I remember that
> errors will be reported if other characters occurring in a .erl file
> when compiling.
>
> But when trying R15B, it looks that values beyond ISO-latin-1 are also
> accepted. So now, we can use UTF8 without BOM encoding, and with the
> help of ct_expand, I managed to say "hello world" in Chinese
> literally.
>
> I wonder is there any plan add Unicode support in string- and
> character-literals?
>
> -compile({parse_transform, ct_expand}).
>
> -define(STR(S), ct_expand:term(unicode:characters_to_list(list_to_binary(S)))).
>
> hello_world() ->
>      S = ?STR("你好, 世界"),
>      io:format("~ts~n", [S]).
>

The code that you wrote is actually the following:

     S = ?STR("ä½ å¥½, ä¸–ç•Œ"),

Even if your editor shows you chinese characters and saves the file as 
utf-8, Erlang still treats the input as Latin-1. (All byte sequences are 
valid latin-1, so there is no foolproof way of separating utf-8 files 
from latin-1 files automatically).

To understand where things go can wrong if you start saving source files 
as utf-8, consider the following two modules:

module(m1).
...
     Pid ! "Mickaël",
     ...

module(m2).
...
     receive
       "Mickaël" -> ok
     end
     ...

Assume that the first is saved with Latin-1 and the second with UTF-8. 
Even though they may look the same to your eyes (because your editor 
hides the difference) the code in the second file is really waiting for 
the following string, and the program will not work:

     receive
       "Micka\303\253l" -> ok
     end

   /Richard