AW: [erlang-questions] Handling UTF-8 data when parsing XML using xmerl

Roessner, Silvester silvester.roessner@REDACTED
Fri Sep 4 11:33:36 CEST 2009


Hi Seth,

I had also problems with unicode support in xmerl.

My solution is to convert the list containing unicode code-points
(which I get in my case from .NET)
into a UTF-8 string xmerl can handle.

fix_unicode(XmlString) ->
	Binary = unicode:characters_to_binary(XmlString, unicode),
	binary_to_list(Binary).

Since I do only parse XML data with xmerl
but use my own functions to output xml
I can't tell you if this fix your problem.

Hope my answer isn't too late 
- and that it is really related to your problem ;-)

Silvester

____________________
 
Carl Zeiss Vision GmbH
 
S i l v e s t e r   R ö ß n e r
Corporate IT Solutions Center
 
Team Head Calculation Engine / VI - IS5
 
Telefon / Phone:  +49 (7361) 591 831
Fax:              +49 (7361) 591 498
mailto:silvester.roessner@REDACTED
http://www.vision.zeiss.com
 
Carl Zeiss Vision GmbH, Turnstr. 27, 73430 Aalen
Geschäftsführer: Dr. Raymund Heinen, Thomas Radke
Sitz der Gesellschaft: 73430 Aalen, Deutschland
Amtsgericht Ulm, HRB 501574, USt.-IdNr:DE 237 102 722

 

> This message is intended for a particular addressee only and
may contain business or company secrets. If you have received
this email in error, please contact the sender and delete the
message immediately. Any use of this email, including saving,
publishing, copying, replication or forwarding of the message
or the contents is not permitted.

-----Ursprüngliche Nachricht-----
> Von: erlang-questions@REDACTED 
> [mailto:erlang-questions@REDACTED] Im Auftrag von Seth Falcon
> Gesendet: Mittwoch, 19. August 2009 18:17
> An: Erlang Questions
> Betreff: [erlang-questions] Handling UTF-8 data when parsing 
> XML using xmerl
> 
> Hi all,
> 
> I'm using xmerl to parse Atom feed data and have encountered 
> some surprising behavior with respect to how UTF-8 encoded 
> data is handled.
> 
> The problem I started to solve is as follows:
> 
>    Consider this XML:
> 
>       <entry>
>         <content type='xhtml'>
>           <a href='/blah'>blah</a>
>         </content>
>       </entry>
> 
>    The goal is to extract the contents of the <content> node as a
>    single string.  So parse_content(Xml) should return "<a
>    href='/blah'>blah</a>".
> 
> The approach I took was to use xmerl to parse the entire 
> document, and then use xmerl:export_simple/2 on the children 
> of <content> to recapture the text.  But in testing with 
> UTF-8 data, I'm finding that while xmerl will parse UTF-8 
> data, it cannot later handle the representation it creates 
> when calling xmerl:export_simple.
> 
> Here's an example of what I'm seeing:
> 
> First, here's the contents of file simple.xml (pasting the 
> UTF-8 below, crossing fingers that it comes across in email). 
>  The body of the title tag can be reproduced in an Erlang session as:
> 
>   HiThere = [72,105,32,8230,32,116,104,101,114,101].
> 
> 
> %% simple.xml:
> <?xml version="1.0" encoding="UTF-8"?>
> <title>Hi ? there</title>
> 
> %% now here's what I see:
> 
> 2> {Xml, _} = xmerl_scan:file("simple.xml").
> {{xmlElement,title,title,[],
>              {xmlNamespace,[],[]},
>              [],1,[],
>              [{xmlText,[{title,1}],
>                        1,[],
>                        [72,105,32,8230,32,116,104,101,114,101],
>                        text}],
>              [],".",undeclared},
>  []}
> 
> 3> Exported = lists:flatten(xmerl:export_simple([Xml], xmerl_xml)).
> [60,63,120,109,108,32,118,101,114,115,105,111,110,61,34,49,
>  46,48,34,63,62,60,116,105,116,108,101,62,72|...]
> 
> 4> xmerl_scan:string(Exported).
> 3265- fatal: 
> {error,{wfc_Legal_Character,{error,{bad_character,8230}}}}
> ** exception exit: {fatal,
>                        
> {{error,{wfc_Legal_Character,{error,{bad_character,8230}}}},
>                         {file,file_name_unknown},
>                         {line,1},
>                         {col,34}}}
>      in function  xmerl_scan:fatal/2
>      in call from xmerl_scan:scan_char_data/5
>      in call from xmerl_scan:scan_content/11
>      in call from xmerl_scan:scan_element/12
>      in call from xmerl_scan:scan_document/2
>      in call from xmerl_scan:string/2
> 
> 
> %% If I make the following transformation, things work again:
> 
> 5> 
> xmerl_scan:string(binary_to_list(unicode:characters_to_binary(
> Exported))).
> {{xmlElement,title,title,[],
>              {xmlNamespace,[],[]},
>              [],1,[],
>              [{xmlText,[{title,1}],
>                        1,[],
>                        [72,105,32,8230,32,116,104,101,114,101],
>                        text}],
>              [],"/opt/seth/EVRI/sg/GIT/zgst/rods",undeclared},
>  []}
> 
> 
> %% and strangely, given that I think I do have valid UTF-8, 
> this also %% works:
> 
> 6> xmerl_scan:string(Exported, [{encoding, latin1}]).
> 
> 
> Questions:
> 
> * Is this the expected behavior?
> 
> * Suggestions for a better way of doing the parsing or handling the
>   UTF-8 strings?
> 
> Thanks,
> 
> + seth
> 
> ________________________________________________________________
> erlang-questions mailing list. See 
> http://www.erlang.org/faq.html erlang-questions (at) erlang.org
> 
> 



More information about the erlang-questions mailing list