[erlang-bugs] Unicode bug in io:format

Erik Søe Sørensen ess@REDACTED
Tue Nov 22 14:02:55 CET 2011


On 22-11-2011 13:11, eurekafag wrote:
> Many thanks for this thorough research! However I have two things to 
> mention. Setting or getting encoding introduces noticeable delay in 
> launching without -noinput, but with it it starts just as fast as 
> usual. Pretty strange.
Yes, I noticed that too; the delay is so long that there is probably a 
timeout somewhere.

> And another a bit illogical issue: to print UTF-8 strings one should 
> NOT set binary type /utf8. This works fine with encoding 
> set: io:format("~ts~n", [<<"Тестовая строка">>]).
> This fails in both noinput-cases with encoding set: io:format("~ts~n", 
> [<<"Тестовая строка"/utf8>>]).
Remember that still, *source files are always interpreted as latin-1*.

 From http://www.erlang.org/doc/apps/stdlib/unicode_usage.html :

    It is convenient to be able to write a list of Unicode characters in
    the string syntax. However, the language specifies strings as being
    in the ISO-latin-1 character set which the compiler tool chain as
    well as many other tools expect.

    Also the source code is (for now) still expected to be written using
    the ISO-latin-1 character set, why Unicode characters beyond that
    range cannot be entered in string literals.

Which means that the "/utf8" modifier will always do a latin1->utf8 
encoding.
So, yes, if you ensure that your source files are UTF-8 encoded, you can 
use the string literals as they are, and expect them to be UTF-8.

> I guess it's because of double encoding (by explicitly defined 
> encoding and that suffix) but I was confused at first. It's better not 
> to set encoding but declare it in binary strings like they do in 
> Python prepending strings with 'u' literal, which doesn't work in 
> Erlang for all cases.
Well, for the u"..." syntax, Python also needs to know the encoding of 
the source file. Unlike Erlang, however, Python can be told what the 
encoding is (and can recognize Unicode files which begin with a BOM 
character).

/Erik
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20111122/eb4b39f9/attachment.htm>


More information about the erlang-bugs mailing list