[erlang-questions] [erlang-bugs] xmerl and unicode data
Ali Sabil
ali.sabil@REDACTED
Sat Oct 20 16:13:07 CEST 2012
On Fri, Oct 19, 2012 at 5:18 PM, Patrik Nyblom <pan@REDACTED> wrote:
> On 10/19/2012 05:01 PM, Anthony Ramine wrote:
>>
>> Le 19 oct. 2012 à 15:58, Ali Sabil a écrit :
>>
>>> Hi all,
>>>
>>> I was wondering if anyone came across the following behaviour?
>>>
>>>
>>> Erlang R15B02 (erts-5.9.2) [source] [64-bit] [smp:4:4]
>>> [async-threads:0] [hipe] [kernel-poll:false] [dtrace]
>>>
>>> Eshell V5.9.2 (abort with ^G)
>>> 1> xmerl_scan:string("<?xml version=\"1.0\"
>>> encoding=\"utf-8\"?><test>你好 Björk</test>").
>>> 3414- fatal: {error,{wfc_Legal_Character,{error,{bad_character,20320}}}}
>>> ** exception exit:
>>> {fatal,{{error,{wfc_Legal_Character,{error,{bad_character,20320}}}},
>>> {file,file_name_unknown},
>>> {line,1},
>>> {col,47}}}
>>> in function xmerl_scan:fatal/2 (xmerl_scan.erl, line 4102)
>>> in call from xmerl_scan:scan_char_data/5 (xmerl_scan.erl, line 2703)
>>> in call from xmerl_scan:scan_content/11 (xmerl_scan.erl, line 2615)
>>> in call from xmerl_scan:scan_element/12 (xmerl_scan.erl, line 2128)
>>> in call from xmerl_scan:scan_document/2 (xmerl_scan.erl, line 570)
>>> in call from xmerl_scan:string/2 (xmerl_scan.erl, line 286)
>>> 2>
>>> 2> xmerl_scan:string("<?xml version=\"1.0\"
>>> encoding=\"utf-8\"?><test>你好 Björk</test>", [{encoding, latin1}]).
>>> {{xmlElement,test,test,[],
>>> {xmlNamespace,[],[]},
>>> [],1,[],
>>> [{xmlText,[{test,1}],
>>> 1,[],
>>> [20320,22909,32,66,106,246,114,107],
>>> text}],
>>> [],"/Users/asabil/test",
>>> undeclared},
>>> []}
>>> 3>
>>> 3> io:getopts().
>>> [{expand_fun,#Fun<group.0.129081181>},
>>> {echo,true},
>>> {binary,false},
>>> {encoding,unicode}]
>>>
>>>
>>> Thanks,
>>> Ali
>>
>> Hi,
>>
>> From my vague souvenirs of xmerl's innards, I'm pretty sure it happens
>> because xmerl_scan:string expects a list of bytes and does not check
>> whether
>> a given byte is valid latin1.
>>
>> Regards,
>>
> That's right, or rather, you say it's UTF-8 encoded, but it's a Unicode
> string (with Unicode code points). Converting it to a list of UTF-8 bytes
> would do the trick:
>
> xmerl_scan:string(binary_to_list(unicode:characters_to_binary("<?xml
> version=\"1.0\" encoding=\"utf-8\"?><test>你好 Björk</test>"))).
>
> xmerl:string actually takes a list of bytes (in this case UTF-8 encoded
> characters), which is not the same as a Unicode string in Erlang...
Thank you very much, however I find it weird that the output is then a
unicode string instead of a list of utf-8 encoded bytes:
4> xmerl_scan:string(binary_to_list(unicode:characters_to_binary("<?xml
version=\"1.0\" encoding=\"utf-8\"?><test>你好 Björk</test>"))).
{{xmlElement,test,test,[],
{xmlNamespace,[],[]},
[],1,[],
[{xmlText,[{test,1}],
1,[],
[20320,22909,32,66,106,246,114,107],
text}],
[],"/Users/asabil/test",
undeclared},
[]}
Unless I am mistaken, the output is
[20320,22909,32,66,106,246,114,107] which is a list of code points ie.
a unicode string.
Thanks again,
Ali
More information about the erlang-questions
mailing list