[erlang-questions] [erlang-bugs] xmerl and unicode data

Patrik Nyblom pan@REDACTED
Mon Oct 22 16:27:46 CEST 2012


On 10/20/2012 04:13 PM, Ali Sabil wrote:
> On Fri, Oct 19, 2012 at 5:18 PM, Patrik Nyblom<pan@REDACTED>  wrote:
>> On 10/19/2012 05:01 PM, Anthony Ramine wrote:
>>> Le 19 oct. 2012 à 15:58, Ali Sabil a écrit :
>>>
>>>> Hi all,
>>>>
>>>> I was wondering if anyone came across the following behaviour?
>>>>
>>>>
>>>> Erlang R15B02 (erts-5.9.2) [source] [64-bit] [smp:4:4]
>>>> [async-threads:0] [hipe] [kernel-poll:false] [dtrace]
>>>>
>>>> Eshell V5.9.2  (abort with ^G)
>>>> 1>   xmerl_scan:string("<?xml version=\"1.0\"
>>>> encoding=\"utf-8\"?><test>你好 Björk</test>").
>>>> 3414- fatal: {error,{wfc_Legal_Character,{error,{bad_character,20320}}}}
>>>> ** exception exit:
>>>> {fatal,{{error,{wfc_Legal_Character,{error,{bad_character,20320}}}},
>>>>                             {file,file_name_unknown},
>>>>                             {line,1},
>>>>                             {col,47}}}
>>>>       in function  xmerl_scan:fatal/2 (xmerl_scan.erl, line 4102)
>>>>       in call from xmerl_scan:scan_char_data/5 (xmerl_scan.erl, line 2703)
>>>>       in call from xmerl_scan:scan_content/11 (xmerl_scan.erl, line 2615)
>>>>       in call from xmerl_scan:scan_element/12 (xmerl_scan.erl, line 2128)
>>>>       in call from xmerl_scan:scan_document/2 (xmerl_scan.erl, line 570)
>>>>       in call from xmerl_scan:string/2 (xmerl_scan.erl, line 286)
>>>> 2>
>>>> 2>   xmerl_scan:string("<?xml version=\"1.0\"
>>>> encoding=\"utf-8\"?><test>你好 Björk</test>", [{encoding, latin1}]).
>>>> {{xmlElement,test,test,[],
>>>>               {xmlNamespace,[],[]},
>>>>               [],1,[],
>>>>               [{xmlText,[{test,1}],
>>>>                         1,[],
>>>>                         [20320,22909,32,66,106,246,114,107],
>>>>                         text}],
>>>>               [],"/Users/asabil/test",
>>>>               undeclared},
>>>> []}
>>>> 3>
>>>> 3>   io:getopts().
>>>> [{expand_fun,#Fun<group.0.129081181>},
>>>> {echo,true},
>>>> {binary,false},
>>>> {encoding,unicode}]
>>>>
>>>>
>>>> Thanks,
>>>> Ali
>>> Hi,
>>>
>>>   From my vague souvenirs of xmerl's innards, I'm pretty sure it happens
>>> because xmerl_scan:string expects a list of bytes and does not check
>>> whether
>>> a given byte is valid latin1.
>>>
>>> Regards,
>>>
>> That's right, or rather, you say it's UTF-8 encoded, but it's a Unicode
>> string (with Unicode code points). Converting it to a list of UTF-8 bytes
>> would do the trick:
>>
>>   xmerl_scan:string(binary_to_list(unicode:characters_to_binary("<?xml
>> version=\"1.0\" encoding=\"utf-8\"?><test>你好 Björk</test>"))).
>>
>> xmerl:string actually takes a list of bytes (in this case UTF-8 encoded
>> characters), which is not the same as a Unicode string in Erlang...
> Thank you very much, however I find it weird that the output is then a
> unicode string instead of a list of utf-8 encoded bytes:
Me too - it's really weird. Rather the input format is weird, lists of 
UTF-8 characters are usually considered "broken". I would wrap this in a 
function taking a proper Unicode list as a parameter...

>
> 4>   xmerl_scan:string(binary_to_list(unicode:characters_to_binary("<?xml
> version=\"1.0\" encoding=\"utf-8\"?><test>你好 Björk</test>"))).
> {{xmlElement,test,test,[],
>               {xmlNamespace,[],[]},
>               [],1,[],
>               [{xmlText,[{test,1}],
>                         1,[],
>                         [20320,22909,32,66,106,246,114,107],
>                         text}],
>               [],"/Users/asabil/test",
>               undeclared},
>   []}
>
> Unless I am mistaken, the output is
> [20320,22909,32,66,106,246,114,107] which is a list of code points ie.
> a unicode string.
>
> Thanks again,
> Ali
Cheers,
/Patrik



More information about the erlang-questions mailing list