[erlang-questions] Strange behaviour in unicode:characters_to_list/2

Robert Virding <>
Thu Jun 2 02:23:44 CEST 2011


A quick comment though as to why test 12 returns false is that the values in tests 8 and 10 differ. The question is of course why they differ as they should be the same.

I can confirm your results running from the shell with some extra examples. 

34> B = << 0 , 195 , 160 , 98 , 99 , 129 , 0 , 0 , 0 >>.
<<0,195,160,98,99,129,0,0,0>>
35> B1 = << 195 , 160 , 98 , 99 , 129 , 0 , 0 , 0 , 0 >>.
<<195,160,98,99,129,0,0,0,0>>
36> <<_:1/binary,Rest3/binary>> = B.
<<0,195,160,98,99,129,0,0,0>>
37> Rest4 = list_to_binary([Rest3]).
<<195,160,98,99,129,0,0,0>>
38> Rest3==Rest4.
true 
41> Rest5 = binary:copy(Rest3).                         
<<195,160,98,99,129,0,0,0>>
42> <<Rest6:8/binary,_/binary>> = B1.          
<<195,160,98,99,129,0,0,0,0>>
44> unicode:characters_to_list(Rest3,utf8).
{error,"àbc",<<99,129,0,0>>}
45> unicode:characters_to_list(Rest4,utf8).
{error,"àbc",<<129,0,0,0>>}
46> unicode:characters_to_list(Rest5,utf8).
{error,"àbc",<<129,0,0,0>>}
48> unicode:characters_to_list(Rest6,utf8).   
{error,"àbc",<<129,0,0,0>>}

So Rest3 does behave differently *in this case* from other binaries which contain the same bytes but have been created in a different way. Interestingly if Rest3 has been sent to another process it there behaves normally.

My partially educated guess is that when Rest3 has been created in that way then there is some "rest" in it which causes unicode:characters_to_list/2 to behave in a strange way. This goes away when the binary is copied in any way. But just guessing.

I am running R14B02 on Mac OSX.

Robert

----- "Philip Baker" <> wrote:

> I've only been using erlang for a couple of days, so I may be missing
> something, but I've come across some behavior that I find baffling
> with the unicode:characters_to_list function, when it is passed a
> binary with an illegal utf8 character.
> 
> Here is some code that demonstrates it:
> 
> -define(TESTCALL(N, Call), io:format("~p:  ~s: ~n\t\t~p~n~n", [N,
> ??Call, Call]), timer:sleep(1)).
> 
> test2( )->
>     ?TESTCALL(1, unicode:characters_to_list(<<16#c3, 16#a0, 98, 99,
> 16#81,0,0,0 >>, utf8)),
>     ?TESTCALL(2, Rest = <<16#c3, 16#a0, 98, 99, 16#81,0,0,0 >>), 
>     ?TESTCALL(3, unicode:characters_to_list(Rest, utf8)),
>     ?TESTCALL(4, <<Rest2/binary>> = <<16#c3, 16#a0, 98, 99,
> 16#81,0,0,0 >>), 
>     ?TESTCALL(5, unicode:characters_to_list(Rest2, utf8)),
>     ?TESTCALL(6, <<_:1/binary, Rest3/binary>> = <<0, 16#c3, 16#a0, 98,
> 99, 16#81,0,0,0 >>), 
>     ?TESTCALL(7, Rest3), 
>     ?TESTCALL(8, unicode:characters_to_list(Rest3, utf8)),
>     ?TESTCALL(9, Rest4 = list_to_binary([Rest3])),
>     ?TESTCALL(10, unicode:characters_to_list(Rest4, utf8)),
>     ?TESTCALL(11, Rest4 =:= Rest3),
>     ?TESTCALL(12, unicode:characters_to_list(Rest3, utf8) =:=
> unicode:characters_to_list(Rest4, utf8)).
> 
> Calling test2() gives this output:
> 
> 1:  unicode : characters_to_list ( << 195 , 160 , 98 , 99 , 129 , 0 ,
> 0 , 0 >> , utf8 ): 
> 		{error,"àbc",<<129,0,0,0>>}
> 
> 2:  Rest = << 195 , 160 , 98 , 99 , 129 , 0 , 0 , 0 >>: 
> 		<<195,160,98,99,129,0,0,0>>
> 
> 3:  unicode : characters_to_list ( Rest , utf8 ): 
> 		{error,"àbc",<<129,0,0,0>>}
> 
> 4:  << Rest2 / binary >> = << 195 , 160 , 98 , 99 , 129 , 0 , 0 , 0
> >>: 
> 		<<195,160,98,99,129,0,0,0>>
> 
> 5:  unicode : characters_to_list ( Rest2 , utf8 ): 
> 		{error,"àbc",<<129,0,0,0>>}
> 
> 6:  << _ : 1 / binary , Rest3 / binary >> = << 0 , 195 , 160 , 98 , 99
> , 129 , 0 , 0 , 0 >>: 
> 		<<0,195,160,98,99,129,0,0,0>>
> 
> 7:  Rest3: 
> 		<<195,160,98,99,129,0,0,0>>
> 
> 8:  unicode : characters_to_list ( Rest3 ): 
> 		{error,"àbc",<<99,129,0,0>>}
> 
> 9:  Rest4 = list_to_binary ( [ Rest3 ] ): 
> 		<<195,160,98,99,129,0,0,0>>
> 
> 10:  unicode : characters_to_list ( Rest4 , utf8 ): 
> 		{error,"àbc",<<129,0,0,0>>}
> 
> 11:  Rest4 =:= Rest3: 
> 		true
> 
> 12:  unicode : characters_to_list ( Rest3 , utf8 ) =:= unicode :
> characters_to_list ( Rest4 , utf8 ): 
> 		false
> 
> 
> Looking at lines 11 and 12, I don't understand how if Rest4 and Rest3
> are equal, the calls to characters_to_list could give unequal results.
> 
> 
> The earlier lines demonstrate various ways I have constructed binaries
> to pass to characters_to_list. In most cases the results are what I
> expect, but lines 6-8 show that when I use pattern matching to extract
> it from a larger binary, characters_to_list produces output where the
> letter 'c' is included in both the converted list, and the unconverted
> "RestData" binary. In lines 9 and 10, I create an apparently identical
> binary using list_to_binary/1 with a list containing only the "broken"
> binary. This new binary produces the correct output again.
> 
> I'm using R14B03 on Windows, in case that makes a difference.
> 
> If anyone can tell me if there is something I am misunderstanding, or
> if this is a bug, I'd appreciate it greatly. 
> 
> 
> Philip Baker
> Software Developer
> 
> Cassidian Communications, an EADS North America Company 
> 75 Boul. de la Technologie
> Gatineau, Québec
> Canada, J8Z 3G4
> 819.778.2053, x243 DIRECT
> www.CassidianCommunications.com
> 
> 
> _______________________________________________
> erlang-questions mailing list
> 
> http://erlang.org/mailman/listinfo/erlang-questions



More information about the erlang-questions mailing list