[bug & patch] xmerl_scan doesn't decode &#x refs properly

Mon Jun 7 18:17:47 CEST 2010

Hello,

There is a bug in xmerl_scan. It doesn't decode &#x refs properly.

Considering the following code :

{UTF8Output, []} = xmerl_scan:string("<?xml version=\"1\" ?>\n<text>" ++ [229, 145, 156] ++ "</text>"),
#xmlElement{content = [#xmlText{value = UTF8Text}]} = UTF8Output,
{DecEntityOutput, []} = xmerl_scan:string("<?xml version=\"1\" ?>\n<text>呜</text>"),
#xmlElement{content = [#xmlText{value = DecEntityText}]} = DecEntityOutput,
{HexEntityOutput, []} = xmerl_scan:string("<?xml version=\"1\" ?>\n<text>&#x545C;</text>"),
#xmlElement{content = [#xmlText{value = HexEntityText}]} = HexEntityOutput,

UTF8Text and DecEntityText are equal and as expected ([16#545C]).
HexEntityText is (incorrectly) a list composed of the three UTF8 bytes [229, 145, 156] while it should be equal to [16#545C].

A patch with a test case can be found here:

git fetch git://github.com/pguyot/otp.git pg/xmerl_scan_hex_entities

Regards,

Paul
-- 
Semiocast                    http://semiocast.com/
+33.175000290 - 62 bis rue Gay-Lussac, 75005 Paris