[erlang-questions] xmerl and "encoded" utf-8 attributes ?

Alexandre Snarskii snar@REDACTED
Tue Aug 20 09:08:48 CEST 2013


Hi!

one-liner example: taking attribute value from XML: 

26> Xl= <<"<item name='100 &#x441;&#x442; and more' />">>, 
   {El, _} = xmerl_scan:string(binary_to_list(Xl)), 
   [A]=El#xmlElement.attributes, Avb=list_to_binary(A#xmlAttribute.value).
<<49,48,48,32,129,209,130,209,32,97,110,100,32,109,111,114,101>>

and then conversion to unicode fails after first four bytes: 

27> unicode:characters_to_list(Avb).                                         

{error,"100 ", <<129,209,130,209,32,97,110,100,32,109,111,114,101>>}

What I think is wrong here: first four bytes (49,48,48,32) are fine 
("100 " in binary), last bytes (32,97,110,100,32,109,111,114,101) are 
fine too (" and more"), but middle four looks like "inverted utf8" 
with first and second byte swapped. Indeed, per utf-8 specification, 
encoding for 16#441 should be 

[ 2#110 + <upper five bits>, 2#10 + <lower six> ], 

or 

30> [2#11000000 bor (16#441 bsr 6), 2#10000000 bor (16#441 band 2#0111111)].
[209,129]

and for 16#442: 

31> [2#11000000 bor (16#442 bsr 6), 2#10000000 bor (16#442 band 2#0111111)].
[209,130]

but in provided binary these bytes are reordered as 129,209 and 130,209.

After manual byte reordering everything is fine: 

29> Avc = <<49,48,48,32,209,129,209,130,32,97,110,100,32,109,111,114,101>>.
<<49,48,48,32,209,129,209,130,32,97,110,100,32,109,111,114,101>>
30> unicode:characters_to_list(Avc).                                         
[49,48,48,32,1089,1090,32,97,110,100,32,109,111,114,101]

Bug ? Feature ? My misunderstanding on how it supposed to work ? 

-- 
In theory, there is no difference between theory and practice. 
But, in practice, there is. 




More information about the erlang-questions mailing list