[erlang-questions] source file encoding

Thu Dec 29 09:08:19 CET 2011

On 12-20 13:00, Richard Carlsson wrote:
> On 12/19/2011 11:18 AM, Justus wrote:
> >Hi all,
> >
> >Strings must be in the ISO-latin-1 character set. I remember that
> >errors will be reported if other characters occurring in a .erl file
> >when compiling.
> >
> >But when trying R15B, it looks that values beyond ISO-latin-1 are also
> >accepted. So now, we can use UTF8 without BOM encoding, and with the
> >help of ct_expand, I managed to say "hello world" in Chinese
> >literally.
> >
> >I wonder is there any plan add Unicode support in string- and
> >character-literals?
> >
> >-compile({parse_transform, ct_expand}).
> >
> >-define(STR(S), ct_expand:term(unicode:characters_to_list(list_to_binary(S)))).
> >
> >hello_world() ->
> >     S = ?STR("你好, 世界"),
> >     io:format("~ts~n", [S]).
> >
> 
> The code that you wrote is actually the following:
> 
>     S = ?STR("ä½ å¥½, ä¸–ç•Œ"),
> 
> Even if your editor shows you chinese characters and saves the file
> as utf-8, Erlang still treats the input as Latin-1. (All byte
> sequences are valid latin-1, so there is no foolproof way of
> separating utf-8 files from latin-1 files automatically).
> 
> To understand where things go can wrong if you start saving source
> files as utf-8, consider the following two modules:
> 
> module(m1).
> ...
>     Pid ! "Mickaël",
>     ...
> 
> module(m2).
> ...
>     receive
>       "Mickaël" -> ok
>     end
>     ...
> 
> Assume that the first is saved with Latin-1 and the second with
> UTF-8. Even though they may look the same to your eyes (because your
> editor hides the difference) the code in the second file is really
> waiting for the following string, and the program will not work:
> 
>     receive
>       "Micka\303\253l" -> ok
>     end
> 

This is how erl_scan works, when it finds starting " of string,
then it doesn't analize what is next, just searches for next closing ",
(modulo skiping over escaped \", so beetween " and ", can be anything
at all (but not new lines). Then it is parsed as is. Kind a bug actually.

Check this https://github.com/baryluk/otp/tree/source_code_encoding_in_compiler_and_epp

I have WIP branch which enables you choising file encoding for sources
files, it is actually single small change in erl_scan.erl module, but
patch is slightly larger, to make it fully configurable, to have
documentation and tests. So is still not complete. Basically I use it for UTF-8
files (it is most safe, also when reading ASCII encoded files).

Here is a diff
https://github.com/baryluk/otp/compare/master...source_code_encoding_in_compiler_and_epp

Regards,
Witek

-- 
Witold Baryluk