[erlang-bugs] eunit_surefire doesn't ensure proper UTF-8 encoding

Thu Feb 6 09:52:23 CET 2014

well, the email was already dug up by some past version of myself :),
I should reread my own threads before responding

On 6 February 2014 09:50, Samuel <samuelrivas@REDACTED> wrote:
> I have a patch somewhere that partially solves it by removing the
> wrong characters from the generated xml document. The real issue is in
> eunit however and is a bit more hairy to fix (not because the fix
> itself but because it is difficult to test), I also have an email
> explaining that somewhere.
>
> I can try to dig them up if it is of some use, but I cannot promise a
> patch for the root cause any soon.
>
> On 31 January 2014 14:28, Siri Hansen <erlangsiri@REDACTED> wrote:
>> Thanks for the report - I have written a ticket for this. A contribution
>> will of course speed up the handling... :)
>> /siri@REDACTED
>>
>>
>> 2013-11-15 Samuel <samuelrivas@REDACTED>:
>>
>>> We have seen this in the past, but we fixed it in our own surefire
>>> rebar plugin. If I remember right, the problem is not in
>>> eunit_surefire.erl but in eunit itself. Below some information I dug
>>> out of my emal (unfortunately I never found time to produce a proper
>>> patch for this:
>>>
>>> ------
>>> I am pretty sure the patch
>>>
>>> https://github.com/richcarl/eunit/commit/9f505f1b8881f44c1e5d37df005533b2af6d6a7e
>>> does not solve the right problem.
>>>
>>> As far as I can understand, the output is already in binary state when
>>> it reaches the eunit_surefile code, which means that it is already
>>> encoded. The patch seems to work because the encoding happened to be
>>> latin1 (by coincidence) and then re-encoding to UTF8 works.
>>>
>>> The root issue seems to be in eunit_proc, that ignores the encoding of
>>> the io_requests and then buffer_to_binary just does list_to_binary.
>>>
>>> The patch seems to work because it does the right thing for codepoints
>>> between 127 and 255, as they are the same as the latin1 encoding for
>>> them. Thus they get properly encoded to utf-8 when writing the xml
>>> file, but will probably fail if the binary passed to eunit_surefile
>>> were properly encoded in utf-8.
>>>
>>> There is a major issue with that, and is that eunit_proc will crash if
>>> any test outputs a codepoint higher than 255, I think I have a proper
>>> fix for that but I haven't had the time to test it thoroughly yet.
>>> When fixed, the surefile report must be written in raw again, as the
>>> binaries should be utf8 encoded already.
>>>
>>> Next patch makes it work again, but is a hack, as it assumes the
>>> strings to be unicode in the list form and utf8 in the binary form
>>> (which I guess is true in current OTP implementation):
>>>
>>> -buffer_to_binary(Buf) -> list_to_binary(lists:reverse(Buf)).
>>> +buffer_to_binary(Buf) ->
>>> unicode:characters_to_binary(lists:reverse(Buf)).
>>>
>>>
>>> As an example, the attached suite causes this when run:
>>>
>>> > eunit_unicode_crash:test().
>>>
>>> =ERROR REPORT==== 27-Aug-2012::14:26:49 ===
>>> Error in process <0.78.0> with exit value:
>>>
>>> {badarg,[{erlang,list_to_binary,[[[[1013],"\n"]]],[]},{eunit_proc,buffer_to_binary,1,[{file,"eunit_proc.erl"},{line,276}]},{eunit_proc,group_leader_loop,3,[{file,"eunit_proc.erl"},{line,600}]}]}
>>>
>>> eunit_unicode_crash: unicode_test (module
>>> 'eunit_unicode_crash')...*skipped*
>>> undefined
>>> *unexpected termination of test process*
>>> ::{badarg,[{erlang,list_to_binary,[[[[1013],"\n"]]],[]},
>>>            {eunit_proc,buffer_to_binary,1,
>>>                        [{file,"eunit_proc.erl"},{line,276}]},
>>>            {eunit_proc,group_leader_loop,3,
>>>                        [{file,"eunit_proc.erl"},{line,600}]}]}
>>>
>>> On 13 November 2013 13:38, Magnus Henoch <magnus@REDACTED>
>>> wrote:
>>> > Compile the following module and run eunit_xml_encoding_bug:doit() from
>>> > an Erlang shell:
>>> >
>>> > -module(eunit_xml_encoding_bug).
>>> >
>>> > -compile(export_all).
>>> >
>>> > -include_lib("eunit/include/eunit.hrl").
>>> >
>>> > doit() ->
>>> >     eunit:test(?MODULE, [{report, {eunit_surefire,[]}}]).
>>> >
>>> > my_test_() ->
>>> >     ?_test(io:format([128,10])).
>>> >
>>> > This creates a file called TEST-eunit_xml_encoding_bug.xml which claims
>>> > to be in UTF-8 (its first line is '<?xml version="1.0" encoding="UTF-8"
>>> > ?>')
>>> > but contains an improperly encoded character.  Most XML tools will
>>> > refuse to do anything with such an XML file.  For example xmllint says:
>>> >
>>> > $ xmllint /tmp/TEST-eunit_xml_encoding_bug.xml
>>> > /tmp/TEST-eunit_xml_encoding_bug.xml:4: parser error : Input is not
>>> > proper UTF-8, indicate encoding !
>>> >
>>> > And opening the file in Firefox yields:
>>> >
>>> > XML Parsing Error: not well-formed
>>> > Location: file:///tmp/TEST-eunit_xml_encoding_bug.xml
>>> > Line Number 4, Column 17:
>>> >
>>> > I came across this problem when running a Quickcheck property inside
>>> > Eunit.  The Quickcheck property would output random binary data with
>>> > io:format("~p"), and sometimes that would end up being high bytes which
>>> > were valid Latin-1 but invalid UTF-8.
>>> >
>>> > As eunit_surefire declares its output files to be in UTF-8 encoding, I
>>> > think it should check that the contents of <system-out> etc are properly
>>> > encoded, and if not do something about it, e.g. convert from Latin-1 to
>>> > UTF-8 or insert replacement characters (U+FFFD).
>>> >
>>> > Regards,
>>> > Magnus
>>> > _______________________________________________
>>> > erlang-bugs mailing list
>>> > erlang-bugs@REDACTED
>>> > http://erlang.org/mailman/listinfo/erlang-bugs
>>>
>>>
>>>
>>> --
>>> Samuel
>>>
>>> _______________________________________________
>>> erlang-bugs mailing list
>>> erlang-bugs@REDACTED
>>> http://erlang.org/mailman/listinfo/erlang-bugs
>>>
>>
>
>
>
> --
> Samuel


-- 
Samuel