[erlang-questions] source file encoding
Witold Baryluk
baryluk@REDACTED
Thu Dec 29 09:08:19 CET 2011
On 12-20 13:00, Richard Carlsson wrote:
> On 12/19/2011 11:18 AM, Justus wrote:
> >Hi all,
> >
> >Strings must be in the ISO-latin-1 character set. I remember that
> >errors will be reported if other characters occurring in a .erl file
> >when compiling.
> >
> >But when trying R15B, it looks that values beyond ISO-latin-1 are also
> >accepted. So now, we can use UTF8 without BOM encoding, and with the
> >help of ct_expand, I managed to say "hello world" in Chinese
> >literally.
> >
> >I wonder is there any plan add Unicode support in string- and
> >character-literals?
> >
> >-compile({parse_transform, ct_expand}).
> >
> >-define(STR(S), ct_expand:term(unicode:characters_to_list(list_to_binary(S)))).
> >
> >hello_world() ->
> > S = ?STR("你好, 世界"),
> > io:format("~ts~n", [S]).
> >
>
> The code that you wrote is actually the following:
>
> S = ?STR("ä½ å¥½, ä¸ç"),
>
> Even if your editor shows you chinese characters and saves the file
> as utf-8, Erlang still treats the input as Latin-1. (All byte
> sequences are valid latin-1, so there is no foolproof way of
> separating utf-8 files from latin-1 files automatically).
>
> To understand where things go can wrong if you start saving source
> files as utf-8, consider the following two modules:
>
> module(m1).
> ...
> Pid ! "Mickaël",
> ...
>
> module(m2).
> ...
> receive
> "Mickaël" -> ok
> end
> ...
>
> Assume that the first is saved with Latin-1 and the second with
> UTF-8. Even though they may look the same to your eyes (because your
> editor hides the difference) the code in the second file is really
> waiting for the following string, and the program will not work:
>
> receive
> "Micka\303\253l" -> ok
> end
>
This is how erl_scan works, when it finds starting " of string,
then it doesn't analize what is next, just searches for next closing ",
(modulo skiping over escaped \", so beetween " and ", can be anything
at all (but not new lines). Then it is parsed as is. Kind a bug actually.
Check this https://github.com/baryluk/otp/tree/source_code_encoding_in_compiler_and_epp
I have WIP branch which enables you choising file encoding for sources
files, it is actually single small change in erl_scan.erl module, but
patch is slightly larger, to make it fully configurable, to have
documentation and tests. So is still not complete. Basically I use it for UTF-8
files (it is most safe, also when reading ASCII encoded files).
Here is a diff
https://github.com/baryluk/otp/compare/master...source_code_encoding_in_compiler_and_epp
Regards,
Witek
--
Witold Baryluk
More information about the erlang-questions
mailing list