[erlang-questions] xmerl and "encoded" utf-8 attributes ?
Alexandre Snarskii
snar@REDACTED
Tue Aug 20 09:08:48 CEST 2013
Hi!
one-liner example: taking attribute value from XML:
26> Xl= <<"<item name='100 ст and more' />">>,
{El, _} = xmerl_scan:string(binary_to_list(Xl)),
[A]=El#xmlElement.attributes, Avb=list_to_binary(A#xmlAttribute.value).
<<49,48,48,32,129,209,130,209,32,97,110,100,32,109,111,114,101>>
and then conversion to unicode fails after first four bytes:
27> unicode:characters_to_list(Avb).
{error,"100 ", <<129,209,130,209,32,97,110,100,32,109,111,114,101>>}
What I think is wrong here: first four bytes (49,48,48,32) are fine
("100 " in binary), last bytes (32,97,110,100,32,109,111,114,101) are
fine too (" and more"), but middle four looks like "inverted utf8"
with first and second byte swapped. Indeed, per utf-8 specification,
encoding for 16#441 should be
[ 2#110 + <upper five bits>, 2#10 + <lower six> ],
or
30> [2#11000000 bor (16#441 bsr 6), 2#10000000 bor (16#441 band 2#0111111)].
[209,129]
and for 16#442:
31> [2#11000000 bor (16#442 bsr 6), 2#10000000 bor (16#442 band 2#0111111)].
[209,130]
but in provided binary these bytes are reordered as 129,209 and 130,209.
After manual byte reordering everything is fine:
29> Avc = <<49,48,48,32,209,129,209,130,32,97,110,100,32,109,111,114,101>>.
<<49,48,48,32,209,129,209,130,32,97,110,100,32,109,111,114,101>>
30> unicode:characters_to_list(Avc).
[49,48,48,32,1089,1090,32,97,110,100,32,109,111,114,101]
Bug ? Feature ? My misunderstanding on how it supposed to work ?
--
In theory, there is no difference between theory and practice.
But, in practice, there is.
More information about the erlang-questions
mailing list