[erlang-questions] [erlang-bugs] xmerl and unicode data

Ali Sabil ali.sabil@REDACTED
Sat Oct 20 16:13:07 CEST 2012


On Fri, Oct 19, 2012 at 5:18 PM, Patrik Nyblom <pan@REDACTED> wrote:
> On 10/19/2012 05:01 PM, Anthony Ramine wrote:
>>
>> Le 19 oct. 2012 à 15:58, Ali Sabil a écrit :
>>
>>> Hi all,
>>>
>>> I was wondering if anyone came across the following behaviour?
>>>
>>>
>>> Erlang R15B02 (erts-5.9.2) [source] [64-bit] [smp:4:4]
>>> [async-threads:0] [hipe] [kernel-poll:false] [dtrace]
>>>
>>> Eshell V5.9.2  (abort with ^G)
>>> 1>  xmerl_scan:string("<?xml version=\"1.0\"
>>> encoding=\"utf-8\"?><test>你好 Björk</test>").
>>> 3414- fatal: {error,{wfc_Legal_Character,{error,{bad_character,20320}}}}
>>> ** exception exit:
>>> {fatal,{{error,{wfc_Legal_Character,{error,{bad_character,20320}}}},
>>>                            {file,file_name_unknown},
>>>                            {line,1},
>>>                            {col,47}}}
>>>      in function  xmerl_scan:fatal/2 (xmerl_scan.erl, line 4102)
>>>      in call from xmerl_scan:scan_char_data/5 (xmerl_scan.erl, line 2703)
>>>      in call from xmerl_scan:scan_content/11 (xmerl_scan.erl, line 2615)
>>>      in call from xmerl_scan:scan_element/12 (xmerl_scan.erl, line 2128)
>>>      in call from xmerl_scan:scan_document/2 (xmerl_scan.erl, line 570)
>>>      in call from xmerl_scan:string/2 (xmerl_scan.erl, line 286)
>>> 2>
>>> 2>  xmerl_scan:string("<?xml version=\"1.0\"
>>> encoding=\"utf-8\"?><test>你好 Björk</test>", [{encoding, latin1}]).
>>> {{xmlElement,test,test,[],
>>>              {xmlNamespace,[],[]},
>>>              [],1,[],
>>>              [{xmlText,[{test,1}],
>>>                        1,[],
>>>                        [20320,22909,32,66,106,246,114,107],
>>>                        text}],
>>>              [],"/Users/asabil/test",
>>>              undeclared},
>>> []}
>>> 3>
>>> 3>  io:getopts().
>>> [{expand_fun,#Fun<group.0.129081181>},
>>> {echo,true},
>>> {binary,false},
>>> {encoding,unicode}]
>>>
>>>
>>> Thanks,
>>> Ali
>>
>> Hi,
>>
>>  From my vague souvenirs of xmerl's innards, I'm pretty sure it happens
>> because xmerl_scan:string expects a list of bytes and does not check
>> whether
>> a given byte is valid latin1.
>>
>> Regards,
>>
> That's right, or rather, you say it's UTF-8 encoded, but it's a Unicode
> string (with Unicode code points). Converting it to a list of UTF-8 bytes
> would do the trick:
>
>  xmerl_scan:string(binary_to_list(unicode:characters_to_binary("<?xml
> version=\"1.0\" encoding=\"utf-8\"?><test>你好 Björk</test>"))).
>
> xmerl:string actually takes a list of bytes (in this case UTF-8 encoded
> characters), which is not the same as a Unicode string in Erlang...

Thank you very much, however I find it weird that the output is then a
unicode string instead of a list of utf-8 encoded bytes:

4>  xmerl_scan:string(binary_to_list(unicode:characters_to_binary("<?xml
version=\"1.0\" encoding=\"utf-8\"?><test>你好 Björk</test>"))).
{{xmlElement,test,test,[],
             {xmlNamespace,[],[]},
             [],1,[],
             [{xmlText,[{test,1}],
                       1,[],
                       [20320,22909,32,66,106,246,114,107],
                       text}],
             [],"/Users/asabil/test",
             undeclared},
 []}

Unless I am mistaken, the output is
[20320,22909,32,66,106,246,114,107] which is a list of code points ie.
a unicode string.

Thanks again,
Ali



More information about the erlang-questions mailing list