[erlang-questions] Fail to parse utf-8 encoded XML

Fri Mar 27 11:46:34 CET 2015

Hi Loïc,

It is a valid unicode code point (http://en.wikipedia.org/wiki/%C3%9C) . If
it wasn't I would have expect either the output of line 1 (which says it is
a utf-8 encoded binary), to show something different, or at least the
"characters_to_list" call to fail.

//Daniel

On Fri, Mar 27, 2015 at 11:27 AM, Loïc Hoguin <essen@REDACTED> wrote:

> Hello,
>
> 252 (what your "ü" gives you) is not a valid Unicode code point.
>
> See https://en.wikipedia.org/wiki/UTF-8#Description
>
> "One-byte codes are used only for the ASCII values 0 through 127."
>
> Guessing the file you read is actually latin1 and not UTF-8.
>
>
> On 03/27/2015 10:52 AM, Daniel Abrahamsson wrote:
>
>> Hi,
>>
>> I'm bit confused about the behaviour of xmerl_scan when dealing with
>> utf-8 data. In short, the XML parser chokes when it encounters a "ü",
>> but not if I specify the encoding as "latin1". Other parsers in other
>> languages (e.g. nokogiri in Ruby) seem to handle this just fine. I've
>> also run the sample XML through various web validators, and they all say
>> it is valid.
>>
>> Is this a bug in xmerl or am I missing something obvious?
>>
>> Example session below:
>>
>> danabr@REDACTED ~> echo -n "<?xml version=\"1.0\"
>> encoding=\"UTF-8\"?><root>ümlaut</root>" > /tmp/test.xml
>> danabr@REDACTED ~> erl
>> Erlang/OTP 17 [erts-6.3] [source] [64-bit] [smp:4:4] [async-threads:10]
>> [hipe] [kernel-poll:false]
>>
>> Eshell V6.3  (abort with ^G)
>> 1> {ok, S} = file:read_file("/tmp/test.xml").
>> {ok,<<"<?xml version=\"1.0\"
>> encoding=\"UTF-8\"?><root>ümlaut</root>"/utf8>>}
>> 2> xmerl_scan:string(unicode:characters_to_list(S)).
>> 3414- fatal: {error,{wfc_Legal_Character,{error,{bad_character,252}}}}
>> ** exception exit:
>> {fatal,{{error,{wfc_Legal_Character,{error,{bad_character,252}}}},
>>                             {file,file_name_unknown},
>>                             {line,1},
>>                             {col,47}}}
>>       in function  xmerl_scan:fatal/2 (xmerl_scan.erl, line 4102)
>>       in call from xmerl_scan:scan_char_data/5 (xmerl_scan.erl, line 2703)
>>       in call from xmerl_scan:scan_content/11 (xmerl_scan.erl, line 2615)
>>       in call from xmerl_scan:scan_element/12 (xmerl_scan.erl, line 2128)
>>       in call from xmerl_scan:scan_document/2 (xmerl_scan.erl, line 570)
>>       in call from xmerl_scan:string/2 (xmerl_scan.erl, line 286)
>> 3> xmerl_scan:string(unicode:characters_to_list(S), [{encoding,
>> "latin1"}]).
>> {{xmlElement,root,root,[],
>>               {xmlNamespace,[],[]},
>>               [],1,[],
>>               [{xmlText,[{root,1}],1,[],"ümlaut",text}],
>>               [],"/home/danabr",undeclared},
>>   []}
>>
>> //Daniel Abrahamsson
>>
>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
>>
>>
> --
> Loïc Hoguin
> http://ninenines.eu
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20150327/5b10debd/attachment.htm>