[erlang-questions] utf-8 and xmerl
Tony Rogvall
tony@REDACTED
Mon Aug 17 14:52:36 CEST 2015
> On 17 aug 2015, at 14:39, Alexander Turkin <snowwlex@REDACTED> wrote:
>
> Hi Hynek,
>
> On 15 August 2015 at 08:30, Hynek Vychodil <vychodil.hynek@REDACTED <mailto:vychodil.hynek@REDACTED>> wrote:
> The same result is in R18 and it it correct result. Letter é has unicode 233 see http://unicode-table.com/en/#00E9 <http://unicode-table.com/en/#00E9>
>
> Yeah, it has U+00E9 (= 233) code point number, but it is coded in 2 bytes in utf8: c3 a9
>
> U+00E9 é c3 a9 LATIN SMALL LETTER E WITH ACUTE
>
>
> (http://www.utf8-chartable.de/ <http://www.utf8-chartable.de/>)
>
> On Fri, Aug 14, 2015 at 6:23 PM, Éric Pailleau <eric.pailleau@REDACTED <mailto:eric.pailleau@REDACTED>> wrote:
> Hello,
> Please precise what Erlang release you are using. Utf8 came lately in Erlang.
> Regards
>
> Le 14 août 2015 13:49, Alexander Turkin <snowwlex@REDACTED <mailto:snowwlex@REDACTED>> a écrit :
> >
> > Dear list,
> >
> >
> > I've got a problem with unicode & xmerl library.
> >
> > Input data for xmerl is utf-8 encoded xml, and what I've got as the result is encoded latin1. But I need utf8!
> >
> >
> > EXAMPLES
> >
> > Body = <<"<?xml version=\"1.0\" encoding=\"UTF-8\"?><response><value>René</value></response>"/utf8>>.
> >
> > (for the sake of portability here is term_to_binary(Body):
> >
> > <<131,109,0,0,0,79,60,63,120,109,108,32,118,101,114,115,
> > 105,111,110,61,34,49,46,48,34,32,101,110,99,111,100,105,
> > 110,103,61,34,85,84,70,45,56,34,63,62,60,114,101,115,
> > 112,111,110,115,101,62,60,118,97,108,117,101,62,82,101,
> > 110,195,169,60,47,118,97,108,117,101,62,60,47,114,101,
> > 115,112,111,110,115,101,62>>
> >
> >
> >
> > (1):
> >
> > When I do
> >
> > xmerl_scan:string(binary_to_list(Body)).
> >
> > it returns
> >
> > {#xmlElement{name = response,expanded_name = response,
> > nsinfo = [],
> > namespace = #xmlNamespace{default = [],nodes = []},
> > parents = [],pos = 1,attributes = [],
> > content = [#xmlElement{name = value,expanded_name = value,
> > nsinfo = [],
> > namespace = #xmlNamespace{default = [],nodes = []},
> > parents = [{response,1}],
> > pos = 1,attributes = [],
> > content = [#xmlText{parents = [{value,1},{response,1}],
> > pos = 1,language = [],
> >
> >
> > value = "René",
> >
> >
> > type = text}],
> > language = [],xmlbase = "/Users/aturkin/ws/",
> > elementdef = undeclared}],
> > language = [],xmlbase = "/Users/aturkin/ws/",
> > elementdef = undeclared},
> > []}
> >
> >
> > So, note there is `value = "René"` string, and it uses [233] symbol, which is latin1.
> >
> >
> >
> >
> > (2):
> >
> > xmerl_scan:string(xmerl_ucs:to_utf8(binary_to_list(Body)))
> >
> > returns
> >
> > {#xmlElement{name = response,expanded_name = response,
> > nsinfo = [],
> > namespace = #xmlNamespace{default = [],nodes = []},
> > parents = [],pos = 1,attributes = [],
> > content = [#xmlElement{name = value,expanded_name = value,
> > nsinfo = [],
> > namespace = #xmlNamespace{default = [],nodes = []},
> > parents = [{response,1}],
> > pos = 1,attributes = [],
> > content = [#xmlText{parents = [{value,1},{response,1}],
> > pos = 1,language = [],
> >
> >
> > value = "René",
> >
> >
> > type = text}],
> > language = [],xmlbase = "/Users/aturkin/ws/",
> > elementdef = undeclared}],
> > language = [],xmlbase = "/Users/aturkin/ws/",
> > elementdef = undeclared},
> > []}
> >
> > Now `value = "René"`, so 2 bytes are used to code this symbol, and this is utf-8.
> >
> > So in (2) I get what I need, but why I need to force that conversion for xmerl?
> >
> >
> >
> >
> > QUESTIONS
> >
> > 1. I don't understand why xmerl_scan allows you to set input encoding, but it looks like there is no way to set output encoding. Is there any way to make xmerl_scan to return utf8 instead of latin1?
> >
> > 2. How is that happen, that in (1) it does conversion utf-8 -> latin1, and in (2) it's utf-8?
> >
> >
> >
> >
> > --
> > Best Regards,
> > Alex Turkin
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED <mailto:erlang-questions@REDACTED>
> http://erlang.org/mailman/listinfo/erlang-questions <http://erlang.org/mailman/listinfo/erlang-questions>
>
>
>
>
> --
> Best Regards,
> Alex Turkin
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20150817/b2598cf7/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20150817/b2598cf7/attachment.bin>
More information about the erlang-questions
mailing list