[erlang-questions] utf-8 and xmerl

Mon Aug 17 14:39:44 CEST 2015

Hi Hynek,

On 15 August 2015 at 08:30, Hynek Vychodil <vychodil.hynek@REDACTED> wrote:

> The same result is in R18 and it it correct result. Letter é has unicode
> 233 see http://unicode-table.com/en/#00E9
>

Yeah, it has U+00E9 (= 233) code point number, but it is coded in 2 bytes
in utf8: c3 a9

U+00E9éc3 a9LATIN SMALL LETTER E WITH ACUTE

(http://www.utf8-chartable.de/)

>
> On Fri, Aug 14, 2015 at 6:23 PM, Éric Pailleau <eric.pailleau@REDACTED>
> wrote:
>
>> Hello,
>> Please precise what Erlang release you are using. Utf8 came lately in
>> Erlang.
>> Regards
>>
>> Le 14 août 2015 13:49, Alexander Turkin <snowwlex@REDACTED> a écrit :
>> >
>> > Dear list,
>> >
>> >
>> > I've got a problem with unicode & xmerl library.
>> >
>> > Input data for xmerl is utf-8 encoded xml, and what I've got as the
>> result is encoded latin1. But I need utf8!
>> >
>> >
>> > EXAMPLES
>> >
>> > Body = <<"<?xml version=\"1.0\"
>> encoding=\"UTF-8\"?><response><value>René</value></response>"/utf8>>.
>> >
>> > (for the sake of portability here is term_to_binary(Body):
>> >
>> > <<131,109,0,0,0,79,60,63,120,109,108,32,118,101,114,115,
>> >   105,111,110,61,34,49,46,48,34,32,101,110,99,111,100,105,
>> >   110,103,61,34,85,84,70,45,56,34,63,62,60,114,101,115,
>> >   112,111,110,115,101,62,60,118,97,108,117,101,62,82,101,
>> >   110,195,169,60,47,118,97,108,117,101,62,60,47,114,101,
>> >   115,112,111,110,115,101,62>>
>> >
>> >
>> >
>> > (1):
>> >
>> > When I do
>> >
>> > xmerl_scan:string(binary_to_list(Body)).
>> >
>> > it returns
>> >
>> > {#xmlElement{name = response,expanded_name = response,
>> >              nsinfo = [],
>> >              namespace = #xmlNamespace{default = [],nodes = []},
>> >              parents = [],pos = 1,attributes = [],
>> >              content = [#xmlElement{name = value,expanded_name = value,
>> >                                     nsinfo = [],
>> >                                     namespace = #xmlNamespace{default =
>> [],nodes = []},
>> >                                     parents = [{response,1}],
>> >                                     pos = 1,attributes = [],
>> >                                     content = [#xmlText{parents =
>> [{value,1},{response,1}],
>> >                                                         pos =
>> 1,language = [],
>> >
>> >
>> >                                                         value = "René",
>> >
>> >
>> >                                                         type = text}],
>> >                                     language = [],xmlbase =
>> "/Users/aturkin/ws/",
>> >                                     elementdef = undeclared}],
>> >              language = [],xmlbase = "/Users/aturkin/ws/",
>> >              elementdef = undeclared},
>> >  []}
>> >
>> >
>> > So, note there is `value = "René"` string, and it uses [233] symbol,
>> which is latin1.
>> >
>> >
>> >
>> >
>> > (2):
>> >
>> > xmerl_scan:string(xmerl_ucs:to_utf8(binary_to_list(Body)))
>> >
>> > returns
>> >
>> > {#xmlElement{name = response,expanded_name = response,
>> >              nsinfo = [],
>> >              namespace = #xmlNamespace{default = [],nodes = []},
>> >              parents = [],pos = 1,attributes = [],
>> >              content = [#xmlElement{name = value,expanded_name = value,
>> >                                     nsinfo = [],
>> >                                     namespace = #xmlNamespace{default =
>> [],nodes = []},
>> >                                     parents = [{response,1}],
>> >                                     pos = 1,attributes = [],
>> >                                     content = [#xmlText{parents =
>> [{value,1},{response,1}],
>> >                                                         pos =
>> 1,language = [],
>> >
>> >
>> >                                                         value = "RenÃ©",
>> >
>> >
>> >                                                         type = text}],
>> >                                     language = [],xmlbase =
>> "/Users/aturkin/ws/",
>> >                                     elementdef = undeclared}],
>> >              language = [],xmlbase = "/Users/aturkin/ws/",
>> >              elementdef = undeclared},
>> >  []}
>> >
>> > Now `value = "RenÃ©"`, so 2 bytes are used to code this symbol, and
>> this is utf-8.
>> >
>> > So in (2) I get what I need, but why I need to force that conversion
>> for xmerl?
>> >
>> >
>> >
>> >
>> > QUESTIONS
>> >
>> > 1. I don't understand why xmerl_scan allows you to set input encoding,
>> but it looks like there is no way to set output encoding. Is there any way
>> to make xmerl_scan to return utf8 instead of latin1?
>> >
>> > 2. How is that happen, that in (1) it does conversion utf-8 -> latin1,
>> and in (2) it's utf-8?
>> >
>> >
>> >
>> >
>> > --
>> > Best Regards,
>> > Alex Turkin
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
>>
>
>

-- 
Best Regards,
Alex Turkin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20150817/7ed74c35/attachment.htm>