[erlang-questions] [erlang-bugs] xmerl and unicode data
Anthony Ramine
n.oxyde@REDACTED
Fri Oct 19 17:24:27 CEST 2012
Le 19 oct. 2012 à 17:18, Patrik Nyblom a écrit :
> On 10/19/2012 05:01 PM, Anthony Ramine wrote:
>> Le 19 oct. 2012 à 15:58, Ali Sabil a écrit :
>>
>>> Hi all,
>>>
>>> I was wondering if anyone came across the following behaviour?
>>>
>>>
>>> Erlang R15B02 (erts-5.9.2) [source] [64-bit] [smp:4:4]
>>> [async-threads:0] [hipe] [kernel-poll:false] [dtrace]
>>>
>>> Eshell V5.9.2 (abort with ^G)
>>> 1> xmerl_scan:string("<?xml version=\"1.0\"
>>> encoding=\"utf-8\"?><test>你好 Björk</test>").
>>> 3414- fatal: {error,{wfc_Legal_Character,{error,{bad_character,20320}}}}
>>> ** exception exit:
>>> {fatal,{{error,{wfc_Legal_Character,{error,{bad_character,20320}}}},
>>> {file,file_name_unknown},
>>> {line,1},
>>> {col,47}}}
>>> in function xmerl_scan:fatal/2 (xmerl_scan.erl, line 4102)
>>> in call from xmerl_scan:scan_char_data/5 (xmerl_scan.erl, line 2703)
>>> in call from xmerl_scan:scan_content/11 (xmerl_scan.erl, line 2615)
>>> in call from xmerl_scan:scan_element/12 (xmerl_scan.erl, line 2128)
>>> in call from xmerl_scan:scan_document/2 (xmerl_scan.erl, line 570)
>>> in call from xmerl_scan:string/2 (xmerl_scan.erl, line 286)
>>> 2>
>>> 2> xmerl_scan:string("<?xml version=\"1.0\"
>>> encoding=\"utf-8\"?><test>你好 Björk</test>", [{encoding, latin1}]).
>>> {{xmlElement,test,test,[],
>>> {xmlNamespace,[],[]},
>>> [],1,[],
>>> [{xmlText,[{test,1}],
>>> 1,[],
>>> [20320,22909,32,66,106,246,114,107],
>>> text}],
>>> [],"/Users/asabil/test",
>>> undeclared},
>>> []}
>>> 3>
>>> 3> io:getopts().
>>> [{expand_fun,#Fun<group.0.129081181>},
>>> {echo,true},
>>> {binary,false},
>>> {encoding,unicode}]
>>>
>>>
>>> Thanks,
>>> Ali
>> Hi,
>>
>> From my vague souvenirs of xmerl's innards, I'm pretty sure it happens
>> because xmerl_scan:string expects a list of bytes and does not check whether
>> a given byte is valid latin1.
>>
>> Regards,
>>
> That's right, or rather, you say it's UTF-8 encoded, but it's a Unicode string (with Unicode code points). Converting it to a list of UTF-8 bytes would do the trick:
>
> xmerl_scan:string(binary_to_list(unicode:characters_to_binary("<?xml version=\"1.0\" encoding=\"utf-8\"?><test>你好 Björk</test>"))).
>
> xmerl:string actually takes a list of bytes (in this case UTF-8 encoded characters), which is not the same as a Unicode string in Erlang...
Oh I had forgotten it also accepts binaries. That explains everything: it accepts an iolist, which is a nested improper list of bytes and binaries.
--
Anthony Ramine
More information about the erlang-questions
mailing list