[erlang-bugs] eunit_surefire doesn't ensure proper UTF-8 encoding
Samuel
samuelrivas@REDACTED
Fri Nov 15 09:42:44 CET 2013
We have seen this in the past, but we fixed it in our own surefire
rebar plugin. If I remember right, the problem is not in
eunit_surefire.erl but in eunit itself. Below some information I dug
out of my emal (unfortunately I never found time to produce a proper
patch for this:
------
I am pretty sure the patch
https://github.com/richcarl/eunit/commit/9f505f1b8881f44c1e5d37df005533b2af6d6a7e
does not solve the right problem.
As far as I can understand, the output is already in binary state when
it reaches the eunit_surefile code, which means that it is already
encoded. The patch seems to work because the encoding happened to be
latin1 (by coincidence) and then re-encoding to UTF8 works.
The root issue seems to be in eunit_proc, that ignores the encoding of
the io_requests and then buffer_to_binary just does list_to_binary.
The patch seems to work because it does the right thing for codepoints
between 127 and 255, as they are the same as the latin1 encoding for
them. Thus they get properly encoded to utf-8 when writing the xml
file, but will probably fail if the binary passed to eunit_surefile
were properly encoded in utf-8.
There is a major issue with that, and is that eunit_proc will crash if
any test outputs a codepoint higher than 255, I think I have a proper
fix for that but I haven't had the time to test it thoroughly yet.
When fixed, the surefile report must be written in raw again, as the
binaries should be utf8 encoded already.
Next patch makes it work again, but is a hack, as it assumes the
strings to be unicode in the list form and utf8 in the binary form
(which I guess is true in current OTP implementation):
-buffer_to_binary(Buf) -> list_to_binary(lists:reverse(Buf)).
+buffer_to_binary(Buf) -> unicode:characters_to_binary(lists:reverse(Buf)).
As an example, the attached suite causes this when run:
> eunit_unicode_crash:test().
=ERROR REPORT==== 27-Aug-2012::14:26:49 ===
Error in process <0.78.0> with exit value:
{badarg,[{erlang,list_to_binary,[[[[1013],"\n"]]],[]},{eunit_proc,buffer_to_binary,1,[{file,"eunit_proc.erl"},{line,276}]},{eunit_proc,group_leader_loop,3,[{file,"eunit_proc.erl"},{line,600}]}]}
eunit_unicode_crash: unicode_test (module 'eunit_unicode_crash')...*skipped*
undefined
*unexpected termination of test process*
::{badarg,[{erlang,list_to_binary,[[[[1013],"\n"]]],[]},
{eunit_proc,buffer_to_binary,1,
[{file,"eunit_proc.erl"},{line,276}]},
{eunit_proc,group_leader_loop,3,
[{file,"eunit_proc.erl"},{line,600}]}]}
On 13 November 2013 13:38, Magnus Henoch <magnus@REDACTED> wrote:
> Compile the following module and run eunit_xml_encoding_bug:doit() from
> an Erlang shell:
>
> -module(eunit_xml_encoding_bug).
>
> -compile(export_all).
>
> -include_lib("eunit/include/eunit.hrl").
>
> doit() ->
> eunit:test(?MODULE, [{report, {eunit_surefire,[]}}]).
>
> my_test_() ->
> ?_test(io:format([128,10])).
>
> This creates a file called TEST-eunit_xml_encoding_bug.xml which claims
> to be in UTF-8 (its first line is '<?xml version="1.0" encoding="UTF-8" ?>')
> but contains an improperly encoded character. Most XML tools will
> refuse to do anything with such an XML file. For example xmllint says:
>
> $ xmllint /tmp/TEST-eunit_xml_encoding_bug.xml
> /tmp/TEST-eunit_xml_encoding_bug.xml:4: parser error : Input is not proper UTF-8, indicate encoding !
>
> And opening the file in Firefox yields:
>
> XML Parsing Error: not well-formed
> Location: file:///tmp/TEST-eunit_xml_encoding_bug.xml
> Line Number 4, Column 17:
>
> I came across this problem when running a Quickcheck property inside
> Eunit. The Quickcheck property would output random binary data with
> io:format("~p"), and sometimes that would end up being high bytes which
> were valid Latin-1 but invalid UTF-8.
>
> As eunit_surefire declares its output files to be in UTF-8 encoding, I
> think it should check that the contents of <system-out> etc are properly
> encoded, and if not do something about it, e.g. convert from Latin-1 to
> UTF-8 or insert replacement characters (U+FFFD).
>
> Regards,
> Magnus
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://erlang.org/mailman/listinfo/erlang-bugs
--
Samuel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: eunit_unicode_crash.erl
Type: text/x-erlang
Size: 147 bytes
Desc: not available
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20131115/016f55ec/attachment.bin>
More information about the erlang-bugs
mailing list