Handling UTF-8 data when parsing XML using xmerl

Seth Falcon seth@REDACTED
Wed Aug 19 18:16:36 CEST 2009


Hi all,

I'm using xmerl to parse Atom feed data and have encountered some
surprising behavior with respect to how UTF-8 encoded data is handled.

The problem I started to solve is as follows:

   Consider this XML:

      <entry>
        <content type='xhtml'>
          <a href='/blah'>blah</a>
        </content>
      </entry>

   The goal is to extract the contents of the <content> node as a
   single string.  So parse_content(Xml) should return "<a
   href='/blah'>blah</a>".

The approach I took was to use xmerl to parse the entire document, and
then use xmerl:export_simple/2 on the children of <content> to
recapture the text.  But in testing with UTF-8 data, I'm finding that
while xmerl will parse UTF-8 data, it cannot later handle the
representation it creates when calling xmerl:export_simple.

Here's an example of what I'm seeing:

First, here's the contents of file simple.xml (pasting the UTF-8
below, crossing fingers that it comes across in email).  The body of
the title tag can be reproduced in an Erlang session as:

  HiThere = [72,105,32,8230,32,116,104,101,114,101].


%% simple.xml:
<?xml version="1.0" encoding="UTF-8"?>
<title>Hi ? there</title>

%% now here's what I see:

2> {Xml, _} = xmerl_scan:file("simple.xml").
{{xmlElement,title,title,[],
             {xmlNamespace,[],[]},
             [],1,[],
             [{xmlText,[{title,1}],
                       1,[],
                       [72,105,32,8230,32,116,104,101,114,101],
                       text}],
             [],".",undeclared},
 []}

3> Exported = lists:flatten(xmerl:export_simple([Xml], xmerl_xml)).
[60,63,120,109,108,32,118,101,114,115,105,111,110,61,34,49,
 46,48,34,63,62,60,116,105,116,108,101,62,72|...]

4> xmerl_scan:string(Exported).
3265- fatal: {error,{wfc_Legal_Character,{error,{bad_character,8230}}}}
** exception exit: {fatal,
                       {{error,{wfc_Legal_Character,{error,{bad_character,8230}}}},
                        {file,file_name_unknown},
                        {line,1},
                        {col,34}}}
     in function  xmerl_scan:fatal/2
     in call from xmerl_scan:scan_char_data/5
     in call from xmerl_scan:scan_content/11
     in call from xmerl_scan:scan_element/12
     in call from xmerl_scan:scan_document/2
     in call from xmerl_scan:string/2


%% If I make the following transformation, things work again:

5> xmerl_scan:string(binary_to_list(unicode:characters_to_binary(Exported))).
{{xmlElement,title,title,[],
             {xmlNamespace,[],[]},
             [],1,[],
             [{xmlText,[{title,1}],
                       1,[],
                       [72,105,32,8230,32,116,104,101,114,101],
                       text}],
             [],"/opt/seth/EVRI/sg/GIT/zgst/rods",undeclared},
 []}


%% and strangely, given that I think I do have valid UTF-8, this also
%% works:

6> xmerl_scan:string(Exported, [{encoding, latin1}]).


Questions:

* Is this the expected behavior?

* Suggestions for a better way of doing the parsing or handling the
  UTF-8 strings?

Thanks,

+ seth


More information about the erlang-questions mailing list