[erlang-bugs] eunit_surefire doesn't ensure proper UTF-8 encoding
Samuel
samuelrivas@REDACTED
Thu Feb 6 09:50:22 CET 2014
I have a patch somewhere that partially solves it by removing the
wrong characters from the generated xml document. The real issue is in
eunit however and is a bit more hairy to fix (not because the fix
itself but because it is difficult to test), I also have an email
explaining that somewhere.
I can try to dig them up if it is of some use, but I cannot promise a
patch for the root cause any soon.
On 31 January 2014 14:28, Siri Hansen <erlangsiri@REDACTED> wrote:
> Thanks for the report - I have written a ticket for this. A contribution
> will of course speed up the handling... :)
> /siri@REDACTED
>
>
> 2013-11-15 Samuel <samuelrivas@REDACTED>:
>
>> We have seen this in the past, but we fixed it in our own surefire
>> rebar plugin. If I remember right, the problem is not in
>> eunit_surefire.erl but in eunit itself. Below some information I dug
>> out of my emal (unfortunately I never found time to produce a proper
>> patch for this:
>>
>> ------
>> I am pretty sure the patch
>>
>> https://github.com/richcarl/eunit/commit/9f505f1b8881f44c1e5d37df005533b2af6d6a7e
>> does not solve the right problem.
>>
>> As far as I can understand, the output is already in binary state when
>> it reaches the eunit_surefile code, which means that it is already
>> encoded. The patch seems to work because the encoding happened to be
>> latin1 (by coincidence) and then re-encoding to UTF8 works.
>>
>> The root issue seems to be in eunit_proc, that ignores the encoding of
>> the io_requests and then buffer_to_binary just does list_to_binary.
>>
>> The patch seems to work because it does the right thing for codepoints
>> between 127 and 255, as they are the same as the latin1 encoding for
>> them. Thus they get properly encoded to utf-8 when writing the xml
>> file, but will probably fail if the binary passed to eunit_surefile
>> were properly encoded in utf-8.
>>
>> There is a major issue with that, and is that eunit_proc will crash if
>> any test outputs a codepoint higher than 255, I think I have a proper
>> fix for that but I haven't had the time to test it thoroughly yet.
>> When fixed, the surefile report must be written in raw again, as the
>> binaries should be utf8 encoded already.
>>
>> Next patch makes it work again, but is a hack, as it assumes the
>> strings to be unicode in the list form and utf8 in the binary form
>> (which I guess is true in current OTP implementation):
>>
>> -buffer_to_binary(Buf) -> list_to_binary(lists:reverse(Buf)).
>> +buffer_to_binary(Buf) ->
>> unicode:characters_to_binary(lists:reverse(Buf)).
>>
>>
>> As an example, the attached suite causes this when run:
>>
>> > eunit_unicode_crash:test().
>>
>> =ERROR REPORT==== 27-Aug-2012::14:26:49 ===
>> Error in process <0.78.0> with exit value:
>>
>> {badarg,[{erlang,list_to_binary,[[[[1013],"\n"]]],[]},{eunit_proc,buffer_to_binary,1,[{file,"eunit_proc.erl"},{line,276}]},{eunit_proc,group_leader_loop,3,[{file,"eunit_proc.erl"},{line,600}]}]}
>>
>> eunit_unicode_crash: unicode_test (module
>> 'eunit_unicode_crash')...*skipped*
>> undefined
>> *unexpected termination of test process*
>> ::{badarg,[{erlang,list_to_binary,[[[[1013],"\n"]]],[]},
>> {eunit_proc,buffer_to_binary,1,
>> [{file,"eunit_proc.erl"},{line,276}]},
>> {eunit_proc,group_leader_loop,3,
>> [{file,"eunit_proc.erl"},{line,600}]}]}
>>
>> On 13 November 2013 13:38, Magnus Henoch <magnus@REDACTED>
>> wrote:
>> > Compile the following module and run eunit_xml_encoding_bug:doit() from
>> > an Erlang shell:
>> >
>> > -module(eunit_xml_encoding_bug).
>> >
>> > -compile(export_all).
>> >
>> > -include_lib("eunit/include/eunit.hrl").
>> >
>> > doit() ->
>> > eunit:test(?MODULE, [{report, {eunit_surefire,[]}}]).
>> >
>> > my_test_() ->
>> > ?_test(io:format([128,10])).
>> >
>> > This creates a file called TEST-eunit_xml_encoding_bug.xml which claims
>> > to be in UTF-8 (its first line is '<?xml version="1.0" encoding="UTF-8"
>> > ?>')
>> > but contains an improperly encoded character. Most XML tools will
>> > refuse to do anything with such an XML file. For example xmllint says:
>> >
>> > $ xmllint /tmp/TEST-eunit_xml_encoding_bug.xml
>> > /tmp/TEST-eunit_xml_encoding_bug.xml:4: parser error : Input is not
>> > proper UTF-8, indicate encoding !
>> >
>> > And opening the file in Firefox yields:
>> >
>> > XML Parsing Error: not well-formed
>> > Location: file:///tmp/TEST-eunit_xml_encoding_bug.xml
>> > Line Number 4, Column 17:
>> >
>> > I came across this problem when running a Quickcheck property inside
>> > Eunit. The Quickcheck property would output random binary data with
>> > io:format("~p"), and sometimes that would end up being high bytes which
>> > were valid Latin-1 but invalid UTF-8.
>> >
>> > As eunit_surefire declares its output files to be in UTF-8 encoding, I
>> > think it should check that the contents of <system-out> etc are properly
>> > encoded, and if not do something about it, e.g. convert from Latin-1 to
>> > UTF-8 or insert replacement characters (U+FFFD).
>> >
>> > Regards,
>> > Magnus
>> > _______________________________________________
>> > erlang-bugs mailing list
>> > erlang-bugs@REDACTED
>> > http://erlang.org/mailman/listinfo/erlang-bugs
>>
>>
>>
>> --
>> Samuel
>>
>> _______________________________________________
>> erlang-bugs mailing list
>> erlang-bugs@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-bugs
>>
>
--
Samuel
More information about the erlang-bugs
mailing list