[erlang-questions] utf-8 and xmerl

Tue Aug 18 17:29:22 CEST 2015

>
> Why do you want to have utf8 in the values ? unicode is much more
> generic and you select the encoding on the way out!

The thing is that I need to convert it to json, and mochijson (and jsx as
well) doesn't understand this* iso-10646. *And when mochijson gets
something not utf-8, it throws `{ucs,{bad_utf8_character_code}}}` error.

On 17 August 2015 at 16:26, Tony Rogvall <tony@REDACTED> wrote:

> Ok I see.
>
> So you expected to find utf8 in the text value instead of the unicode (
> 233 is the same in latin1 and unicode btw )
> But that is not how the xmerl works. It represents the characters in
> unicode iso-10646.
>
> Here is an example to get you a utf8 output. Bin is your binary.
>
> Term = binary_to_term(Bin).
> {Content,_} = xmerl_scan:string(binary_to_list(Term)).
> UnicodeChars = xmerl:export([Content], xmerl_xml).
> Utf8Bin = unicode:characters_to_binary(UnicodeChars).
>
> I guess you could scan the xml structure (Content) and convert the text
> values
> to utf8 strings. But that would complicate the process when you want
> to format the output.
>
> Why do you want to have utf8 in the values ? unicode is much more
> generic and you select the encoding on the way out!
>
> /Tony
>
>
> > On 17 aug 2015, at 15:56, Alexander Turkin <snowwlex@REDACTED> wrote:
> >
> > Hey Tony,
> >
> > Yes, this binary is in utf8 - and it is what it's being fed to xmerl
> library, which returns it in the other encoding by some reasons.
> >
> > On 17 August 2015 at 13:55, Tony Rogvall <tony@REDACTED> wrote:
> > Sorry for the empty message :-)
> >
> > But the coding you are looking for is already in your binary.
> >
> >  <<131,109,0,0,0,79,60,63,120,109,108,32,118,101,114,115,
> >    105,111,110,61,34,49,46,48,34,32,101,110,99,111,100,105,
> >    110,103,61,34,85,84,70,45,56,34,63,62,60,114,101,115,
> >    112,111,110,115,101,62,60,118,97,108,117,101,62,82,101,
> >    110,195,169,60,47,118,97,108,117,101,62,60,47,114,101,
> >    115,112,111,110,115,101,62>>
> >
> > 195,169  = c3 a9
> >
> > /Tony
> >
> >
> >> On 17 aug 2015, at 14:39, Alexander Turkin <snowwlex@REDACTED> wrote:
> >>
> >> Hi Hynek,
> >>
> >> On 15 August 2015 at 08:30, Hynek Vychodil <vychodil.hynek@REDACTED>
> wrote:
> >> The same result is in R18 and it it correct result. Letter é has
> unicode 233 see http://unicode-table.com/en/#00E9
> >>
> >> Yeah, it has U+00E9 (= 233) code point number, but it is coded in 2
> bytes in utf8: c3 a9
> >>
> >> U+00E9       é       c3 a9   LATIN SMALL LETTER E WITH ACUTE
> >>
> >>
> >> (http://www.utf8-chartable.de/)
> >>
> >> On Fri, Aug 14, 2015 at 6:23 PM, Éric Pailleau <
> eric.pailleau@REDACTED> wrote:
> >> Hello,
> >> Please precise what Erlang release you are using. Utf8 came lately in
> Erlang.
> >> Regards
> >>
> >> Le 14 août 2015 13:49, Alexander Turkin <snowwlex@REDACTED> a écrit :
> >> >
> >> > Dear list,
> >> >
> >> >
> >> > I've got a problem with unicode & xmerl library.
> >> >
> >> > Input data for xmerl is utf-8 encoded xml, and what I've got as the
> result is encoded latin1. But I need utf8!
> >> >
> >> >
> >> > EXAMPLES
> >> >
> >> > Body = <<"<?xml version=\"1.0\"
> encoding=\"UTF-8\"?><response><value>René</value></response>"/utf8>>.
> >> >
> >> > (for the sake of portability here is term_to_binary(Body):
> >> >
> >> > <<131,109,0,0,0,79,60,63,120,109,108,32,118,101,114,115,
> >> >   105,111,110,61,34,49,46,48,34,32,101,110,99,111,100,105,
> >> >   110,103,61,34,85,84,70,45,56,34,63,62,60,114,101,115,
> >> >   112,111,110,115,101,62,60,118,97,108,117,101,62,82,101,
> >> >   110,195,169,60,47,118,97,108,117,101,62,60,47,114,101,
> >> >   115,112,111,110,115,101,62>>
> >> >
> >> >
> >> >
> >> > (1):
> >> >
> >> > When I do
> >> >
> >> > xmerl_scan:string(binary_to_list(Body)).
> >> >
> >> > it returns
> >> >
> >> > {#xmlElement{name = response,expanded_name = response,
> >> >              nsinfo = [],
> >> >              namespace = #xmlNamespace{default = [],nodes = []},
> >> >              parents = [],pos = 1,attributes = [],
> >> >              content = [#xmlElement{name = value,expanded_name =
> value,
> >> >                                     nsinfo = [],
> >> >                                     namespace = #xmlNamespace{default
> = [],nodes = []},
> >> >                                     parents = [{response,1}],
> >> >                                     pos = 1,attributes = [],
> >> >                                     content = [#xmlText{parents =
> [{value,1},{response,1}],
> >> >                                                         pos =
> 1,language = [],
> >> >
> >> >
> >> >                                                         value =
> "René",
> >> >
> >> >
> >> >                                                         type = text}],
> >> >                                     language = [],xmlbase =
> "/Users/aturkin/ws/",
> >> >                                     elementdef = undeclared}],
> >> >              language = [],xmlbase = "/Users/aturkin/ws/",
> >> >              elementdef = undeclared},
> >> >  []}
> >> >
> >> >
> >> > So, note there is `value = "René"` string, and it uses [233] symbol,
> which is latin1.
> >> >
> >> >
> >> >
> >> >
> >> > (2):
> >> >
> >> > xmerl_scan:string(xmerl_ucs:to_utf8(binary_to_list(Body)))
> >> >
> >> > returns
> >> >
> >> > {#xmlElement{name = response,expanded_name = response,
> >> >              nsinfo = [],
> >> >              namespace = #xmlNamespace{default = [],nodes = []},
> >> >              parents = [],pos = 1,attributes = [],
> >> >              content = [#xmlElement{name = value,expanded_name =
> value,
> >> >                                     nsinfo = [],
> >> >                                     namespace = #xmlNamespace{default
> = [],nodes = []},
> >> >                                     parents = [{response,1}],
> >> >                                     pos = 1,attributes = [],
> >> >                                     content = [#xmlText{parents =
> [{value,1},{response,1}],
> >> >                                                         pos =
> 1,language = [],
> >> >
> >> >
> >> >                                                         value =
> "RenÃ©",
> >> >
> >> >
> >> >                                                         type = text}],
> >> >                                     language = [],xmlbase =
> "/Users/aturkin/ws/",
> >> >                                     elementdef = undeclared}],
> >> >              language = [],xmlbase = "/Users/aturkin/ws/",
> >> >              elementdef = undeclared},
> >> >  []}
> >> >
> >> > Now `value = "RenÃ©"`, so 2 bytes are used to code this symbol, and
> this is utf-8.
> >> >
> >> > So in (2) I get what I need, but why I need to force that conversion
> for xmerl?
> >> >
> >> >
> >> >
> >> >
> >> > QUESTIONS
> >> >
> >> > 1. I don't understand why xmerl_scan allows you to set input
> encoding, but it looks like there is no way to set output encoding. Is
> there any way to make xmerl_scan to return utf8 instead of latin1?
> >> >
> >> > 2. How is that happen, that in (1) it does conversion utf-8 ->
> latin1, and in (2) it's utf-8?
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Best Regards,
> >> > Alex Turkin
> >> _______________________________________________
> >> erlang-questions mailing list
> >> erlang-questions@REDACTED
> >> http://erlang.org/mailman/listinfo/erlang-questions
> >>
> >>
> >>
> >>
> >> --
> >> Best Regards,
> >> Alex Turkin
> >> _______________________________________________
> >> erlang-questions mailing list
> >> erlang-questions@REDACTED
> >> http://erlang.org/mailman/listinfo/erlang-questions
> >
> >
> >
> >
> > --
> > Best Regards,
> > Alex Turkin
>
>

-- 
Best Regards,
Alex Turkin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20150818/bed5c090/attachment.htm>