[erlang-questions] utf-8 and xmerl

Tony Rogvall tony@REDACTED
Tue Aug 18 19:09:55 CEST 2015


Then I suggest you use an other json module that support unicode OR
scan the xml structure and reformat the text values into UTF while
you are converting the XML into internal JSON structure?
Must be super simple!

/Tony

> On 18 aug 2015, at 17:29, Alexander Turkin <snowwlex@REDACTED> wrote:
> 
> Why do you want to have utf8 in the values ? unicode is much more
> generic and you select the encoding on the way out!
> 
> The thing is that I need to convert it to json, and mochijson (and jsx as well) doesn't understand this iso-10646. And when mochijson gets something not utf-8, it throws `{ucs,{bad_utf8_character_code}}}` error.
> 
> 
> 
> 
> On 17 August 2015 at 16:26, Tony Rogvall <tony@REDACTED <mailto:tony@REDACTED>> wrote:
> Ok I see.
> 
> So you expected to find utf8 in the text value instead of the unicode ( 233 is the same in latin1 and unicode btw )
> But that is not how the xmerl works. It represents the characters in unicode iso-10646.
> 
> Here is an example to get you a utf8 output. Bin is your binary.
> 
> Term = binary_to_term(Bin).
> {Content,_} = xmerl_scan:string(binary_to_list(Term)).
> UnicodeChars = xmerl:export([Content], xmerl_xml).
> Utf8Bin = unicode:characters_to_binary(UnicodeChars).
> 
> I guess you could scan the xml structure (Content) and convert the text values
> to utf8 strings. But that would complicate the process when you want
> to format the output.
> 
> Why do you want to have utf8 in the values ? unicode is much more
> generic and you select the encoding on the way out!
> 
> /Tony
> 
> 
> > On 17 aug 2015, at 15:56, Alexander Turkin <snowwlex@REDACTED <mailto:snowwlex@REDACTED>> wrote:
> >
> > Hey Tony,
> >
> > Yes, this binary is in utf8 - and it is what it's being fed to xmerl library, which returns it in the other encoding by some reasons.
> >
> > On 17 August 2015 at 13:55, Tony Rogvall <tony@REDACTED <mailto:tony@REDACTED>> wrote:
> > Sorry for the empty message :-)
> >
> > But the coding you are looking for is already in your binary.
> >
> >  <<131,109,0,0,0,79,60,63,120,109,108,32,118,101,114,115,
> >    105,111,110,61,34,49,46,48,34,32,101,110,99,111,100,105,
> >    110,103,61,34,85,84,70,45,56,34,63,62,60,114,101,115,
> >    112,111,110,115,101,62,60,118,97,108,117,101,62,82,101,
> >    110,195,169,60,47,118,97,108,117,101,62,60,47,114,101,
> >    115,112,111,110,115,101,62>>
> >
> > 195,169  = c3 a9
> >
> > /Tony
> >
> >
> >> On 17 aug 2015, at 14:39, Alexander Turkin <snowwlex@REDACTED <mailto:snowwlex@REDACTED>> wrote:
> >>
> >> Hi Hynek,
> >>
> >> On 15 August 2015 at 08:30, Hynek Vychodil <vychodil.hynek@REDACTED <mailto:vychodil.hynek@REDACTED>> wrote:
> >> The same result is in R18 and it it correct result. Letter é has unicode 233 see http://unicode-table.com/en/#00E9 <http://unicode-table.com/en/#00E9>
> >>
> >> Yeah, it has U+00E9 (= 233) code point number, but it is coded in 2 bytes in utf8: c3 a9
> >>
> >> U+00E9       é       c3 a9   LATIN SMALL LETTER E WITH ACUTE
> >>
> >>
> >> (http://www.utf8-chartable.de/ <http://www.utf8-chartable.de/>)
> >>
> >> On Fri, Aug 14, 2015 at 6:23 PM, Éric Pailleau <eric.pailleau@REDACTED <mailto:eric.pailleau@REDACTED>> wrote:
> >> Hello,
> >> Please precise what Erlang release you are using. Utf8 came lately in Erlang.
> >> Regards
> >>
> >> Le 14 août 2015 13:49, Alexander Turkin <snowwlex@REDACTED <mailto:snowwlex@REDACTED>> a écrit :
> >> >
> >> > Dear list,
> >> >
> >> >
> >> > I've got a problem with unicode & xmerl library.
> >> >
> >> > Input data for xmerl is utf-8 encoded xml, and what I've got as the result is encoded latin1. But I need utf8!
> >> >
> >> >
> >> > EXAMPLES
> >> >
> >> > Body = <<"<?xml version=\"1.0\" encoding=\"UTF-8\"?><response><value>René</value></response>"/utf8>>.
> >> >
> >> > (for the sake of portability here is term_to_binary(Body):
> >> >
> >> > <<131,109,0,0,0,79,60,63,120,109,108,32,118,101,114,115,
> >> >   105,111,110,61,34,49,46,48,34,32,101,110,99,111,100,105,
> >> >   110,103,61,34,85,84,70,45,56,34,63,62,60,114,101,115,
> >> >   112,111,110,115,101,62,60,118,97,108,117,101,62,82,101,
> >> >   110,195,169,60,47,118,97,108,117,101,62,60,47,114,101,
> >> >   115,112,111,110,115,101,62>>
> >> >
> >> >
> >> >
> >> > (1):
> >> >
> >> > When I do
> >> >
> >> > xmerl_scan:string(binary_to_list(Body)).
> >> >
> >> > it returns
> >> >
> >> > {#xmlElement{name = response,expanded_name = response,
> >> >              nsinfo = [],
> >> >              namespace = #xmlNamespace{default = [],nodes = []},
> >> >              parents = [],pos = 1,attributes = [],
> >> >              content = [#xmlElement{name = value,expanded_name = value,
> >> >                                     nsinfo = [],
> >> >                                     namespace = #xmlNamespace{default = [],nodes = []},
> >> >                                     parents = [{response,1}],
> >> >                                     pos = 1,attributes = [],
> >> >                                     content = [#xmlText{parents = [{value,1},{response,1}],
> >> >                                                         pos = 1,language = [],
> >> >
> >> >
> >> >                                                         value = "René",
> >> >
> >> >
> >> >                                                         type = text}],
> >> >                                     language = [],xmlbase = "/Users/aturkin/ws/",
> >> >                                     elementdef = undeclared}],
> >> >              language = [],xmlbase = "/Users/aturkin/ws/",
> >> >              elementdef = undeclared},
> >> >  []}
> >> >
> >> >
> >> > So, note there is `value = "René"` string, and it uses [233] symbol, which is latin1.
> >> >
> >> >
> >> >
> >> >
> >> > (2):
> >> >
> >> > xmerl_scan:string(xmerl_ucs:to_utf8(binary_to_list(Body)))
> >> >
> >> > returns
> >> >
> >> > {#xmlElement{name = response,expanded_name = response,
> >> >              nsinfo = [],
> >> >              namespace = #xmlNamespace{default = [],nodes = []},
> >> >              parents = [],pos = 1,attributes = [],
> >> >              content = [#xmlElement{name = value,expanded_name = value,
> >> >                                     nsinfo = [],
> >> >                                     namespace = #xmlNamespace{default = [],nodes = []},
> >> >                                     parents = [{response,1}],
> >> >                                     pos = 1,attributes = [],
> >> >                                     content = [#xmlText{parents = [{value,1},{response,1}],
> >> >                                                         pos = 1,language = [],
> >> >
> >> >
> >> >                                                         value = "René",
> >> >
> >> >
> >> >                                                         type = text}],
> >> >                                     language = [],xmlbase = "/Users/aturkin/ws/",
> >> >                                     elementdef = undeclared}],
> >> >              language = [],xmlbase = "/Users/aturkin/ws/",
> >> >              elementdef = undeclared},
> >> >  []}
> >> >
> >> > Now `value = "René"`, so 2 bytes are used to code this symbol, and this is utf-8.
> >> >
> >> > So in (2) I get what I need, but why I need to force that conversion for xmerl?
> >> >
> >> >
> >> >
> >> >
> >> > QUESTIONS
> >> >
> >> > 1. I don't understand why xmerl_scan allows you to set input encoding, but it looks like there is no way to set output encoding. Is there any way to make xmerl_scan to return utf8 instead of latin1?
> >> >
> >> > 2. How is that happen, that in (1) it does conversion utf-8 -> latin1, and in (2) it's utf-8?
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Best Regards,
> >> > Alex Turkin
> >> _______________________________________________
> >> erlang-questions mailing list
> >> erlang-questions@REDACTED <mailto:erlang-questions@REDACTED>
> >> http://erlang.org/mailman/listinfo/erlang-questions <http://erlang.org/mailman/listinfo/erlang-questions>
> >>
> >>
> >>
> >>
> >> --
> >> Best Regards,
> >> Alex Turkin
> >> _______________________________________________
> >> erlang-questions mailing list
> >> erlang-questions@REDACTED <mailto:erlang-questions@REDACTED>
> >> http://erlang.org/mailman/listinfo/erlang-questions <http://erlang.org/mailman/listinfo/erlang-questions>
> >
> >
> >
> >
> > --
> > Best Regards,
> > Alex Turkin
> 
> 
> 
> 
> --
> Best Regards,
> Alex Turkin

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20150818/2b94195b/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20150818/2b94195b/attachment.bin>


More information about the erlang-questions mailing list