Handling UTF-8 data when parsing XML using xmerl
Seth Falcon
seth@REDACTED
Wed Aug 19 18:16:36 CEST 2009
Hi all,
I'm using xmerl to parse Atom feed data and have encountered some
surprising behavior with respect to how UTF-8 encoded data is handled.
The problem I started to solve is as follows:
Consider this XML:
<entry>
<content type='xhtml'>
<a href='/blah'>blah</a>
</content>
</entry>
The goal is to extract the contents of the <content> node as a
single string. So parse_content(Xml) should return "<a
href='/blah'>blah</a>".
The approach I took was to use xmerl to parse the entire document, and
then use xmerl:export_simple/2 on the children of <content> to
recapture the text. But in testing with UTF-8 data, I'm finding that
while xmerl will parse UTF-8 data, it cannot later handle the
representation it creates when calling xmerl:export_simple.
Here's an example of what I'm seeing:
First, here's the contents of file simple.xml (pasting the UTF-8
below, crossing fingers that it comes across in email). The body of
the title tag can be reproduced in an Erlang session as:
HiThere = [72,105,32,8230,32,116,104,101,114,101].
%% simple.xml:
<?xml version="1.0" encoding="UTF-8"?>
<title>Hi ? there</title>
%% now here's what I see:
2> {Xml, _} = xmerl_scan:file("simple.xml").
{{xmlElement,title,title,[],
{xmlNamespace,[],[]},
[],1,[],
[{xmlText,[{title,1}],
1,[],
[72,105,32,8230,32,116,104,101,114,101],
text}],
[],".",undeclared},
[]}
3> Exported = lists:flatten(xmerl:export_simple([Xml], xmerl_xml)).
[60,63,120,109,108,32,118,101,114,115,105,111,110,61,34,49,
46,48,34,63,62,60,116,105,116,108,101,62,72|...]
4> xmerl_scan:string(Exported).
3265- fatal: {error,{wfc_Legal_Character,{error,{bad_character,8230}}}}
** exception exit: {fatal,
{{error,{wfc_Legal_Character,{error,{bad_character,8230}}}},
{file,file_name_unknown},
{line,1},
{col,34}}}
in function xmerl_scan:fatal/2
in call from xmerl_scan:scan_char_data/5
in call from xmerl_scan:scan_content/11
in call from xmerl_scan:scan_element/12
in call from xmerl_scan:scan_document/2
in call from xmerl_scan:string/2
%% If I make the following transformation, things work again:
5> xmerl_scan:string(binary_to_list(unicode:characters_to_binary(Exported))).
{{xmlElement,title,title,[],
{xmlNamespace,[],[]},
[],1,[],
[{xmlText,[{title,1}],
1,[],
[72,105,32,8230,32,116,104,101,114,101],
text}],
[],"/opt/seth/EVRI/sg/GIT/zgst/rods",undeclared},
[]}
%% and strangely, given that I think I do have valid UTF-8, this also
%% works:
6> xmerl_scan:string(Exported, [{encoding, latin1}]).
Questions:
* Is this the expected behavior?
* Suggestions for a better way of doing the parsing or handling the
UTF-8 strings?
Thanks,
+ seth
More information about the erlang-questions
mailing list