[erlang-questions] Fail to parse utf-8 encoded XML

Loïc Hoguin essen@REDACTED
Fri Mar 27 11:50:52 CET 2015


I meant it is not a valid value in UTF-8. But you're right, the issue is 
not there, sorry.

On 03/27/2015 11:46 AM, Daniel Abrahamsson wrote:
> Hi Loïc,
>
> It is a valid unicode code point (http://en.wikipedia.org/wiki/%C3%9C) .
> If it wasn't I would have expect either the output of line 1 (which says
> it is a utf-8 encoded binary), to show something different, or at least
> the "characters_to_list" call to fail.
>
> //Daniel
>
> On Fri, Mar 27, 2015 at 11:27 AM, Loïc Hoguin <essen@REDACTED
> <mailto:essen@REDACTED>> wrote:
>
>     Hello,
>
>     252 (what your "ü" gives you) is not a valid Unicode code point.
>
>     See https://en.wikipedia.org/wiki/__UTF-8#Description
>     <https://en.wikipedia.org/wiki/UTF-8#Description>
>
>     "One-byte codes are used only for the ASCII values 0 through 127."
>
>     Guessing the file you read is actually latin1 and not UTF-8.
>
>
>     On 03/27/2015 10:52 AM, Daniel Abrahamsson wrote:
>
>         Hi,
>
>         I'm bit confused about the behaviour of xmerl_scan when dealing with
>         utf-8 data. In short, the XML parser chokes when it encounters a
>         "ü",
>         but not if I specify the encoding as "latin1". Other parsers in
>         other
>         languages (e.g. nokogiri in Ruby) seem to handle this just fine.
>         I've
>         also run the sample XML through various web validators, and they
>         all say
>         it is valid.
>
>         Is this a bug in xmerl or am I missing something obvious?
>
>         Example session below:
>
>         danabr@REDACTED ~> echo -n "<?xml version=\"1.0\"
>         encoding=\"UTF-8\"?><root>__ümlaut</root>" > /tmp/test.xml
>         danabr@REDACTED ~> erl
>         Erlang/OTP 17 [erts-6.3] [source] [64-bit] [smp:4:4]
>         [async-threads:10]
>         [hipe] [kernel-poll:false]
>
>         Eshell V6.3  (abort with ^G)
>         1> {ok, S} = file:read_file("/tmp/test.xml"__).
>         {ok,<<"<?xml version=\"1.0\"
>         encoding=\"UTF-8\"?><root>__ümlaut</root>"/utf8>>}
>         2> xmerl_scan:string(unicode:__characters_to_list(S)).
>         3414- fatal:
>         {error,{wfc_Legal_Character,{__error,{bad_character,252}}}}
>         ** exception exit:
>         {fatal,{{error,{wfc_Legal___Character,{error,{bad___character,252}}}},
>                                      {file,file_name_unknown},
>                                      {line,1},
>                                      {col,47}}}
>                in function  xmerl_scan:fatal/2 (xmerl_scan.erl, line 4102)
>                in call from xmerl_scan:scan_char_data/5 (xmerl_scan.erl,
>         line 2703)
>                in call from xmerl_scan:scan_content/11 (xmerl_scan.erl,
>         line 2615)
>                in call from xmerl_scan:scan_element/12 (xmerl_scan.erl,
>         line 2128)
>                in call from xmerl_scan:scan_document/2 (xmerl_scan.erl,
>         line 570)
>                in call from xmerl_scan:string/2 (xmerl_scan.erl, line 286)
>         3> xmerl_scan:string(unicode:__characters_to_list(S),
>         [{encoding, "latin1"}]).
>         {{xmlElement,root,root,[],
>                        {xmlNamespace,[],[]},
>                        [],1,[],
>                        [{xmlText,[{root,1}],1,[],"__ümlaut",text}],
>                        [],"/home/danabr",undeclared},
>            []}
>
>         //Daniel Abrahamsson
>
>
>         _________________________________________________
>         erlang-questions mailing list
>         erlang-questions@REDACTED <mailto:erlang-questions@REDACTED>
>         http://erlang.org/mailman/__listinfo/erlang-questions
>         <http://erlang.org/mailman/listinfo/erlang-questions>
>
>
>     --
>     Loïc Hoguin
>     http://ninenines.eu
>
>

-- 
Loïc Hoguin
http://ninenines.eu



More information about the erlang-questions mailing list