[erlang-questions] [erlang-bugs] xmerl and unicode data

Patrik Nyblom pan@REDACTED
Fri Oct 19 17:18:48 CEST 2012


On 10/19/2012 05:01 PM, Anthony Ramine wrote:
> Le 19 oct. 2012 à 15:58, Ali Sabil a écrit :
>
>> Hi all,
>>
>> I was wondering if anyone came across the following behaviour?
>>
>>
>> Erlang R15B02 (erts-5.9.2) [source] [64-bit] [smp:4:4]
>> [async-threads:0] [hipe] [kernel-poll:false] [dtrace]
>>
>> Eshell V5.9.2  (abort with ^G)
>> 1>  xmerl_scan:string("<?xml version=\"1.0\"
>> encoding=\"utf-8\"?><test>你好 Björk</test>").
>> 3414- fatal: {error,{wfc_Legal_Character,{error,{bad_character,20320}}}}
>> ** exception exit:
>> {fatal,{{error,{wfc_Legal_Character,{error,{bad_character,20320}}}},
>>                            {file,file_name_unknown},
>>                            {line,1},
>>                            {col,47}}}
>>      in function  xmerl_scan:fatal/2 (xmerl_scan.erl, line 4102)
>>      in call from xmerl_scan:scan_char_data/5 (xmerl_scan.erl, line 2703)
>>      in call from xmerl_scan:scan_content/11 (xmerl_scan.erl, line 2615)
>>      in call from xmerl_scan:scan_element/12 (xmerl_scan.erl, line 2128)
>>      in call from xmerl_scan:scan_document/2 (xmerl_scan.erl, line 570)
>>      in call from xmerl_scan:string/2 (xmerl_scan.erl, line 286)
>> 2>
>> 2>  xmerl_scan:string("<?xml version=\"1.0\"
>> encoding=\"utf-8\"?><test>你好 Björk</test>", [{encoding, latin1}]).
>> {{xmlElement,test,test,[],
>>              {xmlNamespace,[],[]},
>>              [],1,[],
>>              [{xmlText,[{test,1}],
>>                        1,[],
>>                        [20320,22909,32,66,106,246,114,107],
>>                        text}],
>>              [],"/Users/asabil/test",
>>              undeclared},
>> []}
>> 3>
>> 3>  io:getopts().
>> [{expand_fun,#Fun<group.0.129081181>},
>> {echo,true},
>> {binary,false},
>> {encoding,unicode}]
>>
>>
>> Thanks,
>> Ali
> Hi,
>
>  From my vague souvenirs of xmerl's innards, I'm pretty sure it happens
> because xmerl_scan:string expects a list of bytes and does not check whether
> a given byte is valid latin1.
>
> Regards,
>
That's right, or rather, you say it's UTF-8 encoded, but it's a Unicode 
string (with Unicode code points). Converting it to a list of UTF-8 
bytes would do the trick:

  xmerl_scan:string(binary_to_list(unicode:characters_to_binary("<?xml 
version=\"1.0\" encoding=\"utf-8\"?><test>你好 Björk</test>"))).

xmerl:string actually takes a list of bytes (in this case UTF-8 encoded 
characters), which is not the same as a Unicode string in Erlang...



More information about the erlang-questions mailing list