<div dir="ltr">Thanks for the report - I have written a ticket for this. A contribution will of course speed up the handling... :)<div>/siri@otp</div></div><div class="gmail_extra"><br><br><div class="gmail_quote">2013-11-15 Samuel <span dir="ltr"><<a href="mailto:samuelrivas@gmail.com" target="_blank">samuelrivas@gmail.com</a>></span>:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">We have seen this in the past, but we fixed it in our own surefire<br>
rebar plugin. If I remember right, the problem is not in<br>
eunit_surefire.erl but in eunit itself. Below some information I dug<br>
out of my emal (unfortunately I never found time to produce a proper<br>
patch for this:<br>
<br>
------<br>
I am pretty sure the patch<br>
<a href="https://github.com/richcarl/eunit/commit/9f505f1b8881f44c1e5d37df005533b2af6d6a7e" target="_blank">https://github.com/richcarl/eunit/commit/9f505f1b8881f44c1e5d37df005533b2af6d6a7e</a><br>
does not solve the right problem.<br>
<br>
As far as I can understand, the output is already in binary state when<br>
it reaches the eunit_surefile code, which means that it is already<br>
encoded. The patch seems to work because the encoding happened to be<br>
latin1 (by coincidence) and then re-encoding to UTF8 works.<br>
<br>
The root issue seems to be in eunit_proc, that ignores the encoding of<br>
the io_requests and then buffer_to_binary just does list_to_binary.<br>
<br>
The patch seems to work because it does the right thing for codepoints<br>
between 127 and 255, as they are the same as the latin1 encoding for<br>
them. Thus they get properly encoded to utf-8 when writing the xml<br>
file, but will probably fail if the binary passed to eunit_surefile<br>
were properly encoded in utf-8.<br>
<br>
There is a major issue with that, and is that eunit_proc will crash if<br>
any test outputs a codepoint higher than 255, I think I have a proper<br>
fix for that but I haven't had the time to test it thoroughly yet.<br>
When fixed, the surefile report must be written in raw again, as the<br>
binaries should be utf8 encoded already.<br>
<br>
Next patch makes it work again, but is a hack, as it assumes the<br>
strings to be unicode in the list form and utf8 in the binary form<br>
(which I guess is true in current OTP implementation):<br>
<br>
-buffer_to_binary(Buf) -> list_to_binary(lists:reverse(Buf)).<br>
+buffer_to_binary(Buf) -> unicode:characters_to_binary(lists:reverse(Buf)).<br>
<br>
<br>
As an example, the attached suite causes this when run:<br>
<br>
> eunit_unicode_crash:test().<br>
<br>
=ERROR REPORT==== 27-Aug-2012::14:26:49 ===<br>
Error in process <0.78.0> with exit value:<br>
{badarg,[{erlang,list_to_binary,[[[[1013],"\n"]]],[]},{eunit_proc,buffer_to_binary,1,[{file,"eunit_proc.erl"},{line,276}]},{eunit_proc,group_leader_loop,3,[{file,"eunit_proc.erl"},{line,600}]}]}<br>
<br>
eunit_unicode_crash: unicode_test (module 'eunit_unicode_crash')...*skipped*<br>
undefined<br>
*unexpected termination of test process*<br>
::{badarg,[{erlang,list_to_binary,[[[[1013],"\n"]]],[]},<br>
{eunit_proc,buffer_to_binary,1,<br>
[{file,"eunit_proc.erl"},{line,276}]},<br>
{eunit_proc,group_leader_loop,3,<br>
[{file,"eunit_proc.erl"},{line,600}]}]}<br>
<div class="HOEnZb"><div class="h5"><br>
On 13 November 2013 13:38, Magnus Henoch <<a href="mailto:magnus@erlang-solutions.com">magnus@erlang-solutions.com</a>> wrote:<br>
> Compile the following module and run eunit_xml_encoding_bug:doit() from<br>
> an Erlang shell:<br>
><br>
> -module(eunit_xml_encoding_bug).<br>
><br>
> -compile(export_all).<br>
><br>
> -include_lib("eunit/include/eunit.hrl").<br>
><br>
> doit() -><br>
> eunit:test(?MODULE, [{report, {eunit_surefire,[]}}]).<br>
><br>
> my_test_() -><br>
> ?_test(io:format([128,10])).<br>
><br>
> This creates a file called TEST-eunit_xml_encoding_bug.xml which claims<br>
> to be in UTF-8 (its first line is '<?xml version="1.0" encoding="UTF-8" ?>')<br>
> but contains an improperly encoded character. Most XML tools will<br>
> refuse to do anything with such an XML file. For example xmllint says:<br>
><br>
> $ xmllint /tmp/TEST-eunit_xml_encoding_bug.xml<br>
> /tmp/TEST-eunit_xml_encoding_bug.xml:4: parser error : Input is not proper UTF-8, indicate encoding !<br>
><br>
> And opening the file in Firefox yields:<br>
><br>
> XML Parsing Error: not well-formed<br>
> Location: file:///tmp/TEST-eunit_xml_encoding_bug.xml<br>
> Line Number 4, Column 17:<br>
><br>
> I came across this problem when running a Quickcheck property inside<br>
> Eunit. The Quickcheck property would output random binary data with<br>
> io:format("~p"), and sometimes that would end up being high bytes which<br>
> were valid Latin-1 but invalid UTF-8.<br>
><br>
> As eunit_surefire declares its output files to be in UTF-8 encoding, I<br>
> think it should check that the contents of <system-out> etc are properly<br>
> encoded, and if not do something about it, e.g. convert from Latin-1 to<br>
> UTF-8 or insert replacement characters (U+FFFD).<br>
><br>
> Regards,<br>
> Magnus<br>
> _______________________________________________<br>
> erlang-bugs mailing list<br>
> <a href="mailto:erlang-bugs@erlang.org">erlang-bugs@erlang.org</a><br>
> <a href="http://erlang.org/mailman/listinfo/erlang-bugs" target="_blank">http://erlang.org/mailman/listinfo/erlang-bugs</a><br>
<br>
<br>
<br>
</div></div><span class="HOEnZb"><font color="#888888">--<br>
Samuel<br>
</font></span><br>_______________________________________________<br>
erlang-bugs mailing list<br>
<a href="mailto:erlang-bugs@erlang.org">erlang-bugs@erlang.org</a><br>
<a href="http://erlang.org/mailman/listinfo/erlang-bugs" target="_blank">http://erlang.org/mailman/listinfo/erlang-bugs</a><br>
<br></blockquote></div><br></div>