[erlang-questions] utf-8 and xmerl

Alexander Turkin snowwlex@REDACTED
Mon Aug 17 15:56:56 CEST 2015


Hey Tony,

Yes, this binary is in utf8 - and it is what it's being fed to xmerl
library, which returns it in the other encoding by some reasons.

On 17 August 2015 at 13:55, Tony Rogvall <tony@REDACTED> wrote:

> Sorry for the empty message :-)
>
> But the coding you are looking for is already in your binary.
>
>  <<131,109,0,0,0,79,60,63,120,109,108,32,118,101,114,115,
>    105,111,110,61,34,49,46,48,34,32,101,110,99,111,100,105,
>    110,103,61,34,85,84,70,45,56,34,63,62,60,114,101,115,
>    112,111,110,115,101,62,60,118,97,108,117,101,62,82,101,
>    110,195,169,60,47,118,97,108,117,101,62,60,47,114,101,
>    115,112,111,110,115,101,62>>
>
> 195,169  = c3 a9
>
> /Tony
>
>
> On 17 aug 2015, at 14:39, Alexander Turkin <snowwlex@REDACTED> wrote:
>
> Hi Hynek,
>
> On 15 August 2015 at 08:30, Hynek Vychodil <vychodil.hynek@REDACTED>
> wrote:
>
>> The same result is in R18 and it it correct result. Letter é has unicode
>> 233 see http://unicode-table.com/en/#00E9
>>
>
> Yeah, it has U+00E9 (= 233) code point number, but it is coded in 2 bytes
> in utf8: c3 a9
>
> U+00E9éc3 a9LATIN SMALL LETTER E WITH ACUTE
>
>
> (http://www.utf8-chartable.de/)
>
>>
>> On Fri, Aug 14, 2015 at 6:23 PM, Éric Pailleau <eric.pailleau@REDACTED>
>> wrote:
>>
>>> Hello,
>>> Please precise what Erlang release you are using. Utf8 came lately in
>>> Erlang.
>>> Regards
>>>
>>> Le 14 août 2015 13:49, Alexander Turkin <snowwlex@REDACTED> a écrit :
>>> >
>>> > Dear list,
>>> >
>>> >
>>> > I've got a problem with unicode & xmerl library.
>>> >
>>> > Input data for xmerl is utf-8 encoded xml, and what I've got as the
>>> result is encoded latin1. But I need utf8!
>>> >
>>> >
>>> > EXAMPLES
>>> >
>>> > Body = <<"<?xml version=\"1.0\"
>>> encoding=\"UTF-8\"?><response><value>René</value></response>"/utf8>>.
>>> >
>>> > (for the sake of portability here is term_to_binary(Body):
>>> >
>>> > <<131,109,0,0,0,79,60,63,120,109,108,32,118,101,114,115,
>>> >   105,111,110,61,34,49,46,48,34,32,101,110,99,111,100,105,
>>> >   110,103,61,34,85,84,70,45,56,34,63,62,60,114,101,115,
>>> >   112,111,110,115,101,62,60,118,97,108,117,101,62,82,101,
>>> >   110,195,169,60,47,118,97,108,117,101,62,60,47,114,101,
>>> >   115,112,111,110,115,101,62>>
>>> >
>>> >
>>> >
>>> > (1):
>>> >
>>> > When I do
>>> >
>>> > xmerl_scan:string(binary_to_list(Body)).
>>> >
>>> > it returns
>>> >
>>> > {#xmlElement{name = response,expanded_name = response,
>>> >              nsinfo = [],
>>> >              namespace = #xmlNamespace{default = [],nodes = []},
>>> >              parents = [],pos = 1,attributes = [],
>>> >              content = [#xmlElement{name = value,expanded_name = value,
>>> >                                     nsinfo = [],
>>> >                                     namespace = #xmlNamespace{default
>>> = [],nodes = []},
>>> >                                     parents = [{response,1}],
>>> >                                     pos = 1,attributes = [],
>>> >                                     content = [#xmlText{parents =
>>> [{value,1},{response,1}],
>>> >                                                         pos =
>>> 1,language = [],
>>> >
>>> >
>>> >                                                         value = "René",
>>> >
>>> >
>>> >                                                         type = text}],
>>> >                                     language = [],xmlbase =
>>> "/Users/aturkin/ws/",
>>> >                                     elementdef = undeclared}],
>>> >              language = [],xmlbase = "/Users/aturkin/ws/",
>>> >              elementdef = undeclared},
>>> >  []}
>>> >
>>> >
>>> > So, note there is `value = "René"` string, and it uses [233] symbol,
>>> which is latin1.
>>> >
>>> >
>>> >
>>> >
>>> > (2):
>>> >
>>> > xmerl_scan:string(xmerl_ucs:to_utf8(binary_to_list(Body)))
>>> >
>>> > returns
>>> >
>>> > {#xmlElement{name = response,expanded_name = response,
>>> >              nsinfo = [],
>>> >              namespace = #xmlNamespace{default = [],nodes = []},
>>> >              parents = [],pos = 1,attributes = [],
>>> >              content = [#xmlElement{name = value,expanded_name = value,
>>> >                                     nsinfo = [],
>>> >                                     namespace = #xmlNamespace{default
>>> = [],nodes = []},
>>> >                                     parents = [{response,1}],
>>> >                                     pos = 1,attributes = [],
>>> >                                     content = [#xmlText{parents =
>>> [{value,1},{response,1}],
>>> >                                                         pos =
>>> 1,language = [],
>>> >
>>> >
>>> >                                                         value =
>>> "René",
>>> >
>>> >
>>> >                                                         type = text}],
>>> >                                     language = [],xmlbase =
>>> "/Users/aturkin/ws/",
>>> >                                     elementdef = undeclared}],
>>> >              language = [],xmlbase = "/Users/aturkin/ws/",
>>> >              elementdef = undeclared},
>>> >  []}
>>> >
>>> > Now `value = "René"`, so 2 bytes are used to code this symbol, and
>>> this is utf-8.
>>> >
>>> > So in (2) I get what I need, but why I need to force that conversion
>>> for xmerl?
>>> >
>>> >
>>> >
>>> >
>>> > QUESTIONS
>>> >
>>> > 1. I don't understand why xmerl_scan allows you to set input encoding,
>>> but it looks like there is no way to set output encoding. Is there any way
>>> to make xmerl_scan to return utf8 instead of latin1?
>>> >
>>> > 2. How is that happen, that in (1) it does conversion utf-8 -> latin1,
>>> and in (2) it's utf-8?
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > Best Regards,
>>> > Alex Turkin
>>> _______________________________________________
>>> erlang-questions mailing list
>>> erlang-questions@REDACTED
>>> http://erlang.org/mailman/listinfo/erlang-questions
>>>
>>
>>
>
>
> --
> Best Regards,
> Alex Turkin
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>
>
>


-- 
Best Regards,
Alex Turkin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20150817/8df782dd/attachment.htm>


More information about the erlang-questions mailing list