[erlang-bugs] Bug in xmerl

Fri Jun 27 14:57:26 CEST 2008

It seems there is a bug in xmerl when loading elements that contain numeric
character references followed by UTF-8 characters.

Example: é newline é

1> element(1, xmerl_scan:string("<a>\303\251&#xD;\303\251</a>", [{encoding,
'utf-8'}])).
{xmlElement,a,a,[],
            {xmlNamespace,[],[]},
            [],1,[],
            [{xmlText,[{a,1}],1,[],"\303\251",text},
             {xmlText,[{a,1}],2,[],[10,195,131,194,169],text}],
            [],"/",undeclared}

Xmerl splits the parsed value around the newline character (strange but ok).
However, the first part is encoded correctly while the second part is
garbled!

It's worth noticing that attribute values are encoded correctly:

2> element(1, xmerl_scan:string("<a b=\"\303\251&#xD;\303\251\"/>",
[{encoding, 'utf-8'}])).
{xmlElement,a,a,[],
            {xmlNamespace,[],[]},
            [],1,
            [{xmlAttribute,b,[],[],[],[],1,[],"\303\251 \303\251",false}],
            [],[],"/",undeclared}

- Mikkel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20080627/dc6c2af0/attachment.htm>