[erlang-questions] Fail to parse utf-8 encoded XML
Loïc Hoguin
essen@REDACTED
Fri Mar 27 11:50:52 CET 2015
I meant it is not a valid value in UTF-8. But you're right, the issue is
not there, sorry.
On 03/27/2015 11:46 AM, Daniel Abrahamsson wrote:
> Hi Loïc,
>
> It is a valid unicode code point (http://en.wikipedia.org/wiki/%C3%9C) .
> If it wasn't I would have expect either the output of line 1 (which says
> it is a utf-8 encoded binary), to show something different, or at least
> the "characters_to_list" call to fail.
>
> //Daniel
>
> On Fri, Mar 27, 2015 at 11:27 AM, Loïc Hoguin <essen@REDACTED
> <mailto:essen@REDACTED>> wrote:
>
> Hello,
>
> 252 (what your "ü" gives you) is not a valid Unicode code point.
>
> See https://en.wikipedia.org/wiki/__UTF-8#Description
> <https://en.wikipedia.org/wiki/UTF-8#Description>
>
> "One-byte codes are used only for the ASCII values 0 through 127."
>
> Guessing the file you read is actually latin1 and not UTF-8.
>
>
> On 03/27/2015 10:52 AM, Daniel Abrahamsson wrote:
>
> Hi,
>
> I'm bit confused about the behaviour of xmerl_scan when dealing with
> utf-8 data. In short, the XML parser chokes when it encounters a
> "ü",
> but not if I specify the encoding as "latin1". Other parsers in
> other
> languages (e.g. nokogiri in Ruby) seem to handle this just fine.
> I've
> also run the sample XML through various web validators, and they
> all say
> it is valid.
>
> Is this a bug in xmerl or am I missing something obvious?
>
> Example session below:
>
> danabr@REDACTED ~> echo -n "<?xml version=\"1.0\"
> encoding=\"UTF-8\"?><root>__ümlaut</root>" > /tmp/test.xml
> danabr@REDACTED ~> erl
> Erlang/OTP 17 [erts-6.3] [source] [64-bit] [smp:4:4]
> [async-threads:10]
> [hipe] [kernel-poll:false]
>
> Eshell V6.3 (abort with ^G)
> 1> {ok, S} = file:read_file("/tmp/test.xml"__).
> {ok,<<"<?xml version=\"1.0\"
> encoding=\"UTF-8\"?><root>__ümlaut</root>"/utf8>>}
> 2> xmerl_scan:string(unicode:__characters_to_list(S)).
> 3414- fatal:
> {error,{wfc_Legal_Character,{__error,{bad_character,252}}}}
> ** exception exit:
> {fatal,{{error,{wfc_Legal___Character,{error,{bad___character,252}}}},
> {file,file_name_unknown},
> {line,1},
> {col,47}}}
> in function xmerl_scan:fatal/2 (xmerl_scan.erl, line 4102)
> in call from xmerl_scan:scan_char_data/5 (xmerl_scan.erl,
> line 2703)
> in call from xmerl_scan:scan_content/11 (xmerl_scan.erl,
> line 2615)
> in call from xmerl_scan:scan_element/12 (xmerl_scan.erl,
> line 2128)
> in call from xmerl_scan:scan_document/2 (xmerl_scan.erl,
> line 570)
> in call from xmerl_scan:string/2 (xmerl_scan.erl, line 286)
> 3> xmerl_scan:string(unicode:__characters_to_list(S),
> [{encoding, "latin1"}]).
> {{xmlElement,root,root,[],
> {xmlNamespace,[],[]},
> [],1,[],
> [{xmlText,[{root,1}],1,[],"__ümlaut",text}],
> [],"/home/danabr",undeclared},
> []}
>
> //Daniel Abrahamsson
>
>
> _________________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED <mailto:erlang-questions@REDACTED>
> http://erlang.org/mailman/__listinfo/erlang-questions
> <http://erlang.org/mailman/listinfo/erlang-questions>
>
>
> --
> Loïc Hoguin
> http://ninenines.eu
>
>
--
Loïc Hoguin
http://ninenines.eu
More information about the erlang-questions
mailing list