[erlang-questions] Bug in xmerl

Thu Jun 26 16:09:44 CEST 2008

It seems there is a bug in xmerl when loading elements that contain numeric
character references followed by UTF-8 characters.

Example: é newline é

1> element(1, xmerl_scan:string("<a>\303\251&#xD;\303\251</a>", [{encoding,
'utf-8'}])).
{xmlElement,a,a,[],
            {xmlNamespace,[],[]},
            [],1,[],
            [{xmlText,[{a,1}],1,[],"\303\251",text},
             {xmlText,[{a,1}],2,[],[10,195,131,194,169],text}],
            [],"/",undeclared}

Xmerl splits the parsed value around the newline character (strange but ok).
However, the first part is encoded correctly while the second part is
garbled!

It's worth noticing that attribute values are encoded correctly:

2> element(1, xmerl_scan:string("<a b=\"\303\251&#xD;\303\251\"/>",
[{encoding, 'utf-8'}])).
{xmlElement,a,a,[],
            {xmlNamespace,[],[]},
            [],1,
            [{xmlAttribute,b,[],[],[],[],1,[],"\303\251 \303\251",false}],
            [],[],"/",undeclared}

Can someone confirm if this is a bug?

- Mikkel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20080626/2b89518f/attachment.htm>