<div dir="ltr">Thanks for the report - I have written a ticket for this. A contribution will of course speed up the handling... :)<div>/siri@otp</div></div><div class="gmail_extra"><br><br><div class="gmail_quote">2013-11-15 Samuel <span dir="ltr"><<a href="mailto:samuelrivas@gmail.com" target="_blank">samuelrivas@gmail.com</a>></span>:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">We have seen this in the past, but we fixed it in our own surefire<br>

rebar plugin. If I remember right, the problem is not in<br>

eunit_surefire.erl but in eunit itself. Below some information I dug<br>

out of my emal (unfortunately I never found time to produce a proper<br>

patch for this:<br>

<br>

------<br>

I am pretty sure the patch<br>

<a href="https://github.com/richcarl/eunit/commit/9f505f1b8881f44c1e5d37df005533b2af6d6a7e" target="_blank">https://github.com/richcarl/eunit/commit/9f505f1b8881f44c1e5d37df005533b2af6d6a7e</a><br>

does not solve the right problem.<br>

<br>

As far as I can understand, the output is already in binary state when<br>

it reaches the eunit_surefile code, which means that it is already<br>

encoded. The patch seems to work because the encoding happened to be<br>

latin1 (by coincidence) and then re-encoding to UTF8 works.<br>

<br>

The root issue seems to be in eunit_proc, that ignores the encoding of<br>

the io_requests and then buffer_to_binary just does list_to_binary.<br>

<br>

The patch seems to work because it does the right thing for codepoints<br>

between 127 and 255, as they are the same as the latin1 encoding for<br>

them. Thus they get properly encoded to utf-8 when writing the xml<br>

file, but will probably fail if the binary passed to eunit_surefile<br>

were properly encoded in utf-8.<br>

<br>

There is a major issue with that, and is that eunit_proc will crash if<br>

any test outputs a codepoint higher than 255, I think I have a proper<br>

fix for that but I haven't had the time to test it thoroughly yet.<br>

When fixed, the surefile report must be written in raw again, as the<br>

binaries should be utf8 encoded already.<br>

<br>

Next patch makes it work again, but is a hack, as it assumes the<br>

strings to be unicode in the list form and utf8 in the binary form<br>

(which I guess is true in current OTP implementation):<br>

<br>

-buffer_to_binary(Buf) -> list_to_binary(lists:reverse(Buf)).<br>

+buffer_to_binary(Buf) -> unicode:characters_to_binary(lists:reverse(Buf)).<br>

<br>

<br>

As an example, the attached suite causes this when run:<br>

<br>

> eunit_unicode_crash:test().<br>

<br>

=ERROR REPORT==== 27-Aug-2012::14:26:49 ===<br>

Error in process <0.78.0> with exit value:<br>

{badarg,[{erlang,list_to_binary,[[[[1013],"\n"]]],[]},{eunit_proc,buffer_to_binary,1,[{file,"eunit_proc.erl"},{line,276}]},{eunit_proc,group_leader_loop,3,[{file,"eunit_proc.erl"},{line,600}]}]}<br>


<br>

eunit_unicode_crash: unicode_test (module 'eunit_unicode_crash')...*skipped*<br>

undefined<br>

*unexpected termination of test process*<br>

::{badarg,[{erlang,list_to_binary,[[[[1013],"\n"]]],[]},<br>

           {eunit_proc,buffer_to_binary,1,<br>

                       [{file,"eunit_proc.erl"},{line,276}]},<br>

           {eunit_proc,group_leader_loop,3,<br>

                       [{file,"eunit_proc.erl"},{line,600}]}]}<br>

<div class="HOEnZb"><div class="h5"><br>

On 13 November 2013 13:38, Magnus Henoch <<a href="mailto:magnus@erlang-solutions.com">magnus@erlang-solutions.com</a>> wrote:<br>

> Compile the following module and run eunit_xml_encoding_bug:doit() from<br>

> an Erlang shell:<br>

><br>

> -module(eunit_xml_encoding_bug).<br>

><br>

> -compile(export_all).<br>

><br>

> -include_lib("eunit/include/eunit.hrl").<br>

><br>

> doit() -><br>

>     eunit:test(?MODULE, [{report, {eunit_surefire,[]}}]).<br>

><br>

> my_test_() -><br>

>     ?_test(io:format([128,10])).<br>

><br>

> This creates a file called TEST-eunit_xml_encoding_bug.xml which claims<br>

> to be in UTF-8 (its first line is '<?xml version="1.0" encoding="UTF-8" ?>')<br>

> but contains an improperly encoded character.  Most XML tools will<br>

> refuse to do anything with such an XML file.  For example xmllint says:<br>

><br>

> $ xmllint /tmp/TEST-eunit_xml_encoding_bug.xml<br>

> /tmp/TEST-eunit_xml_encoding_bug.xml:4: parser error : Input is not proper UTF-8, indicate encoding !<br>

><br>

> And opening the file in Firefox yields:<br>

><br>

> XML Parsing Error: not well-formed<br>

> Location: file:///tmp/TEST-eunit_xml_encoding_bug.xml<br>

> Line Number 4, Column 17:<br>

><br>

> I came across this problem when running a Quickcheck property inside<br>

> Eunit.  The Quickcheck property would output random binary data with<br>

> io:format("~p"), and sometimes that would end up being high bytes which<br>

> were valid Latin-1 but invalid UTF-8.<br>

><br>

> As eunit_surefire declares its output files to be in UTF-8 encoding, I<br>

> think it should check that the contents of <system-out> etc are properly<br>

> encoded, and if not do something about it, e.g. convert from Latin-1 to<br>

> UTF-8 or insert replacement characters (U+FFFD).<br>

><br>

> Regards,<br>

> Magnus<br>

> _______________________________________________<br>

> erlang-bugs mailing list<br>

> <a href="mailto:erlang-bugs@erlang.org">erlang-bugs@erlang.org</a><br>

> <a href="http://erlang.org/mailman/listinfo/erlang-bugs" target="_blank">http://erlang.org/mailman/listinfo/erlang-bugs</a><br>

<br>

<br>

<br>

</div></div><span class="HOEnZb"><font color="#888888">--<br>

Samuel<br>

</font></span><br>_______________________________________________<br>

erlang-bugs mailing list<br>

<a href="mailto:erlang-bugs@erlang.org">erlang-bugs@erlang.org</a><br>

<a href="http://erlang.org/mailman/listinfo/erlang-bugs" target="_blank">http://erlang.org/mailman/listinfo/erlang-bugs</a><br>

<br></blockquote></div><br></div>