[erlang-bugs] Unicode bug in io:format

Tue Nov 22 11:02:10 CET 2011

I thought it might have something to do with io:setopts() being called 
when -noinput is absent, and not when it is present; the evidence is 
mixed, but I think I may be on to something useful.

Consider the following extension of your program:

    -module(unicode_test).
    -export([main/0]).

    main() ->
         print(),
         ok = io:setopts(standard_io, [{encoding, unicode}]),
         print().

    print() ->
         io:format("Encoding=~p~n",
    [lists:keyfind(encoding,1,io:getopts())]),
         io:format("~ts~n",
    [[1058,1077,1089,1090,1086,1074,1072,1103,32,1089,1090,1088,1086,1082,1072]]),
         io:format("~ts~n", ["Тестовая строка"]).

Without -noinput (and with LANG=da_DK.utf8), I get:

    1> Encoding={encoding,latin1}
    Тестовая строка
    Ð¢ÐµÑÑ‚Ð¾Ð²Ð°Ñ ÑÑ‚Ñ€Ð¾ÐºÐ°
    Encoding={encoding,latin1}
    Тестовая строка
    Ð¢ÐµÑÑ‚Ð¾Ð²Ð°Ñ ÑÑ‚Ñ€Ð¾ÐºÐ°

i.e. the list-of-integers version is OK in both cases.

With -noinput, I get:

    Encoding={encoding,latin1}
    \x{422}\x{435}\x{441}\x{442}\x{43E}\x{432}\x{430}\x{44F}
    \x{441}\x{442}\x{440}\x{43E}\x{43A}\x{430}
    Тестовая строка
    Encoding={encoding,unicode}
    Тестовая строка
    Ð¢ÐµÑÑ‚Ð¾Ð²Ð°Ñ ÑÑ‚Ñ€Ð¾ÐºÐ°

I.e. first the string-literal version is good, but after using 
io:setopts(), the list-of-integers version is the good one.

So, if you explicitly select unicode encoding in your program, you have 
consistent behaviour.

The only thing that bothers me is that there appears to be something 
else going on - it's not just about the encoding.
I find that without -noinput, output is consistent no matter what I set 
encoding to. With -noinput, on the other hand, output differs whether I 
select latin1 or unicode encoding.

Hoping this helps.
/Erik

On 21-11-2011 22:42, eurekafag wrote:
> Thanks, I'm aware of it. The problem is different behavior with and 
> without -noinput. I'm just curious which case is right and why it 
> makes difference at all. I explicitly define that binary string as 
> utf8-encoded but it only works with -noinput and fails without it. On 
> the other hand, a list without any unicode letters at all (only 
> integers) printed as hex values with -noinput and as test without it. 
> It may be understandable if this is some kind of parser problem which 
> wants latin-1 letters in source but what's wrong with plain list of 
> integers which it fails to output as a string? The problem is that 
> those two cases are mutually exclusive so one of them works with 
> -noinput and fails without and vice versa. So I'm curious which method 
> I should use so it works like expected.
>
> 22 ноября 2011 г. 0:19 пользователь Paul Davis 
> <paul.joseph.davis@REDACTED <mailto:paul.joseph.davis@REDACTED>> 
> написал:
>
>     Oh, good call. I just pasted your code into the shell and it worked.
>     But then when compiling it into a file it breaks like you have.
>     Specifically, the UTF-8 literal in the source file is broken. This
>     suggests that the Erlang compiler doesn't like UTF-8 literals, and
>     sure enough, a quick google brought up a post:
>
>     http://erlang.2086793.n4.nabble.com/utf8-in-source-files-td3031128.html
>
>     Which references:
>
>     http://www.erlang.org/doc/apps/stdlib/unicode_usage.html
>
>     HTH,
>     Paul Davis
>
>     On Mon, Nov 21, 2011 at 2:06 PM, eurekafag <eurekafag@REDACTED
>     <mailto:eurekafag@REDACTED>> wrote:
>     > What exactly do you get? Please, provide the full output of both
>     cases with
>     > and without -noinput. I tried export LANG=en_US.UTF-8 (my
>     system-wide locale
>     > is ru_RU.UTF-8) and I still get the same result.
>     >
>     > _______________________________________________
>     > erlang-bugs mailing list
>     > erlang-bugs@REDACTED <mailto:erlang-bugs@REDACTED>
>     > http://erlang.org/mailman/listinfo/erlang-bugs
>     >
>     >
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20111122/8be5c53e/attachment.htm>