[erlang-questions] [erlang-bugs] xmerl and unicode data
Patrik Nyblom
pan@REDACTED
Fri Oct 19 17:18:48 CEST 2012
On 10/19/2012 05:01 PM, Anthony Ramine wrote:
> Le 19 oct. 2012 à 15:58, Ali Sabil a écrit :
>
>> Hi all,
>>
>> I was wondering if anyone came across the following behaviour?
>>
>>
>> Erlang R15B02 (erts-5.9.2) [source] [64-bit] [smp:4:4]
>> [async-threads:0] [hipe] [kernel-poll:false] [dtrace]
>>
>> Eshell V5.9.2 (abort with ^G)
>> 1> xmerl_scan:string("<?xml version=\"1.0\"
>> encoding=\"utf-8\"?><test>你好 Björk</test>").
>> 3414- fatal: {error,{wfc_Legal_Character,{error,{bad_character,20320}}}}
>> ** exception exit:
>> {fatal,{{error,{wfc_Legal_Character,{error,{bad_character,20320}}}},
>> {file,file_name_unknown},
>> {line,1},
>> {col,47}}}
>> in function xmerl_scan:fatal/2 (xmerl_scan.erl, line 4102)
>> in call from xmerl_scan:scan_char_data/5 (xmerl_scan.erl, line 2703)
>> in call from xmerl_scan:scan_content/11 (xmerl_scan.erl, line 2615)
>> in call from xmerl_scan:scan_element/12 (xmerl_scan.erl, line 2128)
>> in call from xmerl_scan:scan_document/2 (xmerl_scan.erl, line 570)
>> in call from xmerl_scan:string/2 (xmerl_scan.erl, line 286)
>> 2>
>> 2> xmerl_scan:string("<?xml version=\"1.0\"
>> encoding=\"utf-8\"?><test>你好 Björk</test>", [{encoding, latin1}]).
>> {{xmlElement,test,test,[],
>> {xmlNamespace,[],[]},
>> [],1,[],
>> [{xmlText,[{test,1}],
>> 1,[],
>> [20320,22909,32,66,106,246,114,107],
>> text}],
>> [],"/Users/asabil/test",
>> undeclared},
>> []}
>> 3>
>> 3> io:getopts().
>> [{expand_fun,#Fun<group.0.129081181>},
>> {echo,true},
>> {binary,false},
>> {encoding,unicode}]
>>
>>
>> Thanks,
>> Ali
> Hi,
>
> From my vague souvenirs of xmerl's innards, I'm pretty sure it happens
> because xmerl_scan:string expects a list of bytes and does not check whether
> a given byte is valid latin1.
>
> Regards,
>
That's right, or rather, you say it's UTF-8 encoded, but it's a Unicode
string (with Unicode code points). Converting it to a list of UTF-8
bytes would do the trick:
xmerl_scan:string(binary_to_list(unicode:characters_to_binary("<?xml
version=\"1.0\" encoding=\"utf-8\"?><test>你好 Björk</test>"))).
xmerl:string actually takes a list of bytes (in this case UTF-8 encoded
characters), which is not the same as a Unicode string in Erlang...
More information about the erlang-questions
mailing list