[erlang-questions] Fail to parse utf-8 encoded XML

Daniel Abrahamsson daniel.abrahamsson@REDACTED
Fri Mar 27 10:52:46 CET 2015


I'm bit confused about the behaviour of xmerl_scan when dealing with utf-8
data. In short, the XML parser chokes when it encounters a "ü", but not if
I specify the encoding as "latin1". Other parsers in other languages (e.g.
nokogiri in Ruby) seem to handle this just fine. I've also run the sample
XML through various web validators, and they all say it is valid.

Is this a bug in xmerl or am I missing something obvious?

Example session below:

danabr@REDACTED ~> echo -n "<?xml version=\"1.0\"
encoding=\"UTF-8\"?><root>ümlaut</root>" > /tmp/test.xml
danabr@REDACTED ~> erl
Erlang/OTP 17 [erts-6.3] [source] [64-bit] [smp:4:4] [async-threads:10]
[hipe] [kernel-poll:false]

Eshell V6.3  (abort with ^G)
1> {ok, S} = file:read_file("/tmp/test.xml").
{ok,<<"<?xml version=\"1.0\"
2> xmerl_scan:string(unicode:characters_to_list(S)).
3414- fatal: {error,{wfc_Legal_Character,{error,{bad_character,252}}}}
** exception exit:
     in function  xmerl_scan:fatal/2 (xmerl_scan.erl, line 4102)
     in call from xmerl_scan:scan_char_data/5 (xmerl_scan.erl, line 2703)
     in call from xmerl_scan:scan_content/11 (xmerl_scan.erl, line 2615)
     in call from xmerl_scan:scan_element/12 (xmerl_scan.erl, line 2128)
     in call from xmerl_scan:scan_document/2 (xmerl_scan.erl, line 570)
     in call from xmerl_scan:string/2 (xmerl_scan.erl, line 286)
3> xmerl_scan:string(unicode:characters_to_list(S), [{encoding, "latin1"}]).

//Daniel Abrahamsson
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20150327/620a5bb9/attachment.htm>

More information about the erlang-questions mailing list