[erlang-questions] Bug in xmerl

Mikkel Jensen mj@REDACTED
Thu Jun 26 16:09:44 CEST 2008


It seems there is a bug in xmerl when loading elements that contain numeric
character references followed by UTF-8 characters.

Example: é newline é

1> element(1, xmerl_scan:string("<a>\303\251&#xD;\303\251</a>", [{encoding,
'utf-8'}])).
{xmlElement,a,a,[],
            {xmlNamespace,[],[]},
            [],1,[],
            [{xmlText,[{a,1}],1,[],"\303\251",text},
             {xmlText,[{a,1}],2,[],[10,195,131,194,169],text}],
            [],"/",undeclared}

Xmerl splits the parsed value around the newline character (strange but ok).
However, the first part is encoded correctly while the second part is
garbled!

It's worth noticing that attribute values are encoded correctly:

2> element(1, xmerl_scan:string("<a b=\"\303\251&#xD;\303\251\"/>",
[{encoding, 'utf-8'}])).
{xmlElement,a,a,[],
            {xmlNamespace,[],[]},
            [],1,
            [{xmlAttribute,b,[],[],[],[],1,[],"\303\251 \303\251",false}],
            [],[],"/",undeclared}

Can someone confirm if this is a bug?

- Mikkel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20080626/2b89518f/attachment.htm>


More information about the erlang-questions mailing list