[erlang-questions] Fail to parse utf-8 encoded XML

Loïc Hoguin essen@REDACTED
Fri Mar 27 11:27:25 CET 2015


Hello,

252 (what your "ü" gives you) is not a valid Unicode code point.

See https://en.wikipedia.org/wiki/UTF-8#Description

"One-byte codes are used only for the ASCII values 0 through 127."

Guessing the file you read is actually latin1 and not UTF-8.

On 03/27/2015 10:52 AM, Daniel Abrahamsson wrote:
> Hi,
>
> I'm bit confused about the behaviour of xmerl_scan when dealing with
> utf-8 data. In short, the XML parser chokes when it encounters a "ü",
> but not if I specify the encoding as "latin1". Other parsers in other
> languages (e.g. nokogiri in Ruby) seem to handle this just fine. I've
> also run the sample XML through various web validators, and they all say
> it is valid.
>
> Is this a bug in xmerl or am I missing something obvious?
>
> Example session below:
>
> danabr@REDACTED ~> echo -n "<?xml version=\"1.0\"
> encoding=\"UTF-8\"?><root>ümlaut</root>" > /tmp/test.xml
> danabr@REDACTED ~> erl
> Erlang/OTP 17 [erts-6.3] [source] [64-bit] [smp:4:4] [async-threads:10]
> [hipe] [kernel-poll:false]
>
> Eshell V6.3  (abort with ^G)
> 1> {ok, S} = file:read_file("/tmp/test.xml").
> {ok,<<"<?xml version=\"1.0\"
> encoding=\"UTF-8\"?><root>ümlaut</root>"/utf8>>}
> 2> xmerl_scan:string(unicode:characters_to_list(S)).
> 3414- fatal: {error,{wfc_Legal_Character,{error,{bad_character,252}}}}
> ** exception exit:
> {fatal,{{error,{wfc_Legal_Character,{error,{bad_character,252}}}},
>                             {file,file_name_unknown},
>                             {line,1},
>                             {col,47}}}
>       in function  xmerl_scan:fatal/2 (xmerl_scan.erl, line 4102)
>       in call from xmerl_scan:scan_char_data/5 (xmerl_scan.erl, line 2703)
>       in call from xmerl_scan:scan_content/11 (xmerl_scan.erl, line 2615)
>       in call from xmerl_scan:scan_element/12 (xmerl_scan.erl, line 2128)
>       in call from xmerl_scan:scan_document/2 (xmerl_scan.erl, line 570)
>       in call from xmerl_scan:string/2 (xmerl_scan.erl, line 286)
> 3> xmerl_scan:string(unicode:characters_to_list(S), [{encoding, "latin1"}]).
> {{xmlElement,root,root,[],
>               {xmlNamespace,[],[]},
>               [],1,[],
>               [{xmlText,[{root,1}],1,[],"ümlaut",text}],
>               [],"/home/danabr",undeclared},
>   []}
>
> //Daniel Abrahamsson
>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>

-- 
Loïc Hoguin
http://ninenines.eu



More information about the erlang-questions mailing list