[erlang-questions] Strange behaviour in unicode:characters_to_list/2

Philip Baker Philip.Baker@REDACTED
Thu Jun 2 00:32:36 CEST 2011


I've only been using erlang for a couple of days, so I may be missing something, but I've come across some behavior that I find baffling with the unicode:characters_to_list function, when it is passed a binary with an illegal utf8 character.

Here is some code that demonstrates it:

-define(TESTCALL(N, Call), io:format("~p:  ~s: ~n\t\t~p~n~n", [N, ??Call, Call]), timer:sleep(1)).

test2( )->
    ?TESTCALL(1, unicode:characters_to_list(<<16#c3, 16#a0, 98, 99, 16#81,0,0,0 >>, utf8)),
    ?TESTCALL(2, Rest = <<16#c3, 16#a0, 98, 99, 16#81,0,0,0 >>), 
    ?TESTCALL(3, unicode:characters_to_list(Rest, utf8)),
    ?TESTCALL(4, <<Rest2/binary>> = <<16#c3, 16#a0, 98, 99, 16#81,0,0,0 >>), 
    ?TESTCALL(5, unicode:characters_to_list(Rest2, utf8)),
    ?TESTCALL(6, <<_:1/binary, Rest3/binary>> = <<0, 16#c3, 16#a0, 98, 99, 16#81,0,0,0 >>), 
    ?TESTCALL(7, Rest3), 
    ?TESTCALL(8, unicode:characters_to_list(Rest3, utf8)),
    ?TESTCALL(9, Rest4 = list_to_binary([Rest3])),
    ?TESTCALL(10, unicode:characters_to_list(Rest4, utf8)),
    ?TESTCALL(11, Rest4 =:= Rest3),
    ?TESTCALL(12, unicode:characters_to_list(Rest3, utf8) =:= unicode:characters_to_list(Rest4, utf8)).

Calling test2() gives this output:

1:  unicode : characters_to_list ( << 195 , 160 , 98 , 99 , 129 , 0 , 0 , 0 >> , utf8 ): 
		{error,"àbc",<<129,0,0,0>>}

2:  Rest = << 195 , 160 , 98 , 99 , 129 , 0 , 0 , 0 >>: 
		<<195,160,98,99,129,0,0,0>>

3:  unicode : characters_to_list ( Rest , utf8 ): 
		{error,"àbc",<<129,0,0,0>>}

4:  << Rest2 / binary >> = << 195 , 160 , 98 , 99 , 129 , 0 , 0 , 0 >>: 
		<<195,160,98,99,129,0,0,0>>

5:  unicode : characters_to_list ( Rest2 , utf8 ): 
		{error,"àbc",<<129,0,0,0>>}

6:  << _ : 1 / binary , Rest3 / binary >> = << 0 , 195 , 160 , 98 , 99 , 129 , 0 , 0 , 0 >>: 
		<<0,195,160,98,99,129,0,0,0>>

7:  Rest3: 
		<<195,160,98,99,129,0,0,0>>

8:  unicode : characters_to_list ( Rest3 ): 
		{error,"àbc",<<99,129,0,0>>}

9:  Rest4 = list_to_binary ( [ Rest3 ] ): 
		<<195,160,98,99,129,0,0,0>>

10:  unicode : characters_to_list ( Rest4 , utf8 ): 
		{error,"àbc",<<129,0,0,0>>}

11:  Rest4 =:= Rest3: 
		true

12:  unicode : characters_to_list ( Rest3 , utf8 ) =:= unicode : characters_to_list ( Rest4 , utf8 ): 
		false


Looking at lines 11 and 12, I don't understand how if Rest4 and Rest3 are equal, the calls to characters_to_list could give unequal results. 

The earlier lines demonstrate various ways I have constructed binaries to pass to characters_to_list. In most cases the results are what I expect, but lines 6-8 show that when I use pattern matching to extract it from a larger binary, characters_to_list produces output where the letter 'c' is included in both the converted list, and the unconverted "RestData" binary. In lines 9 and 10, I create an apparently identical binary using list_to_binary/1 with a list containing only the "broken" binary. This new binary produces the correct output again.

I'm using R14B03 on Windows, in case that makes a difference.

If anyone can tell me if there is something I am misunderstanding, or if this is a bug, I'd appreciate it greatly. 


Philip Baker
Software Developer

Cassidian Communications, an EADS North America Company 
75 Boul. de la Technologie
Gatineau, Québec
Canada, J8Z 3G4
819.778.2053, x243 DIRECT
www.CassidianCommunications.com





More information about the erlang-questions mailing list