[erlang-questions] [erlang-bugs] xmerl and unicode data

Anthony Ramine n.oxyde@REDACTED
Fri Oct 19 17:24:27 CEST 2012


Le 19 oct. 2012 à 17:18, Patrik Nyblom a écrit :

> On 10/19/2012 05:01 PM, Anthony Ramine wrote:
>> Le 19 oct. 2012 à 15:58, Ali Sabil a écrit :
>> 
>>> Hi all,
>>> 
>>> I was wondering if anyone came across the following behaviour?
>>> 
>>> 
>>> Erlang R15B02 (erts-5.9.2) [source] [64-bit] [smp:4:4]
>>> [async-threads:0] [hipe] [kernel-poll:false] [dtrace]
>>> 
>>> Eshell V5.9.2  (abort with ^G)
>>> 1>  xmerl_scan:string("<?xml version=\"1.0\"
>>> encoding=\"utf-8\"?><test>你好 Björk</test>").
>>> 3414- fatal: {error,{wfc_Legal_Character,{error,{bad_character,20320}}}}
>>> ** exception exit:
>>> {fatal,{{error,{wfc_Legal_Character,{error,{bad_character,20320}}}},
>>>                           {file,file_name_unknown},
>>>                           {line,1},
>>>                           {col,47}}}
>>>     in function  xmerl_scan:fatal/2 (xmerl_scan.erl, line 4102)
>>>     in call from xmerl_scan:scan_char_data/5 (xmerl_scan.erl, line 2703)
>>>     in call from xmerl_scan:scan_content/11 (xmerl_scan.erl, line 2615)
>>>     in call from xmerl_scan:scan_element/12 (xmerl_scan.erl, line 2128)
>>>     in call from xmerl_scan:scan_document/2 (xmerl_scan.erl, line 570)
>>>     in call from xmerl_scan:string/2 (xmerl_scan.erl, line 286)
>>> 2>
>>> 2>  xmerl_scan:string("<?xml version=\"1.0\"
>>> encoding=\"utf-8\"?><test>你好 Björk</test>", [{encoding, latin1}]).
>>> {{xmlElement,test,test,[],
>>>             {xmlNamespace,[],[]},
>>>             [],1,[],
>>>             [{xmlText,[{test,1}],
>>>                       1,[],
>>>                       [20320,22909,32,66,106,246,114,107],
>>>                       text}],
>>>             [],"/Users/asabil/test",
>>>             undeclared},
>>> []}
>>> 3>
>>> 3>  io:getopts().
>>> [{expand_fun,#Fun<group.0.129081181>},
>>> {echo,true},
>>> {binary,false},
>>> {encoding,unicode}]
>>> 
>>> 
>>> Thanks,
>>> Ali
>> Hi,
>> 
>> From my vague souvenirs of xmerl's innards, I'm pretty sure it happens
>> because xmerl_scan:string expects a list of bytes and does not check whether
>> a given byte is valid latin1.
>> 
>> Regards,
>> 
> That's right, or rather, you say it's UTF-8 encoded, but it's a Unicode string (with Unicode code points). Converting it to a list of UTF-8 bytes would do the trick:
> 
> xmerl_scan:string(binary_to_list(unicode:characters_to_binary("<?xml version=\"1.0\" encoding=\"utf-8\"?><test>你好 Björk</test>"))).
> 
> xmerl:string actually takes a list of bytes (in this case UTF-8 encoded characters), which is not the same as a Unicode string in Erlang...

Oh I had forgotten it also accepts binaries. That explains everything: it accepts an iolist, which is a nested improper list of bytes and binaries.

-- 
Anthony Ramine




More information about the erlang-questions mailing list