[erlang-questions] utf-8 and xmerl
Hynek Vychodil
vychodil.hynek@REDACTED
Tue Aug 18 18:17:30 CEST 2015
So then there is a problem in jsx and mochijson. The type of result
returned by xmerl is unicode:chardata() which is new standard for io
operations (see module io as an example). You can convert it using
unicode:characters_to_list/1 for old incompatible modules and fill bug
report or feature request.
On Tue, Aug 18, 2015 at 5:29 PM, Alexander Turkin <snowwlex@REDACTED>
wrote:
> Why do you want to have utf8 in the values ? unicode is much more
>> generic and you select the encoding on the way out!
>
>
> The thing is that I need to convert it to json, and mochijson (and jsx as
> well) doesn't understand this* iso-10646. *And when mochijson gets
> something not utf-8, it throws `{ucs,{bad_utf8_character_code}}}` error.
>
>
>
>
> On 17 August 2015 at 16:26, Tony Rogvall <tony@REDACTED> wrote:
>
>> Ok I see.
>>
>> So you expected to find utf8 in the text value instead of the unicode (
>> 233 is the same in latin1 and unicode btw )
>> But that is not how the xmerl works. It represents the characters in
>> unicode iso-10646.
>>
>> Here is an example to get you a utf8 output. Bin is your binary.
>>
>> Term = binary_to_term(Bin).
>> {Content,_} = xmerl_scan:string(binary_to_list(Term)).
>> UnicodeChars = xmerl:export([Content], xmerl_xml).
>> Utf8Bin = unicode:characters_to_binary(UnicodeChars).
>>
>> I guess you could scan the xml structure (Content) and convert the text
>> values
>> to utf8 strings. But that would complicate the process when you want
>> to format the output.
>>
>> Why do you want to have utf8 in the values ? unicode is much more
>> generic and you select the encoding on the way out!
>>
>> /Tony
>>
>>
>> > On 17 aug 2015, at 15:56, Alexander Turkin <snowwlex@REDACTED> wrote:
>> >
>> > Hey Tony,
>> >
>> > Yes, this binary is in utf8 - and it is what it's being fed to xmerl
>> library, which returns it in the other encoding by some reasons.
>> >
>> > On 17 August 2015 at 13:55, Tony Rogvall <tony@REDACTED> wrote:
>> > Sorry for the empty message :-)
>> >
>> > But the coding you are looking for is already in your binary.
>> >
>> > <<131,109,0,0,0,79,60,63,120,109,108,32,118,101,114,115,
>> > 105,111,110,61,34,49,46,48,34,32,101,110,99,111,100,105,
>> > 110,103,61,34,85,84,70,45,56,34,63,62,60,114,101,115,
>> > 112,111,110,115,101,62,60,118,97,108,117,101,62,82,101,
>> > 110,195,169,60,47,118,97,108,117,101,62,60,47,114,101,
>> > 115,112,111,110,115,101,62>>
>> >
>> > 195,169 = c3 a9
>> >
>> > /Tony
>> >
>> >
>> >> On 17 aug 2015, at 14:39, Alexander Turkin <snowwlex@REDACTED> wrote:
>> >>
>> >> Hi Hynek,
>> >>
>> >> On 15 August 2015 at 08:30, Hynek Vychodil <vychodil.hynek@REDACTED>
>> wrote:
>> >> The same result is in R18 and it it correct result. Letter é has
>> unicode 233 see http://unicode-table.com/en/#00E9
>> >>
>> >> Yeah, it has U+00E9 (= 233) code point number, but it is coded in 2
>> bytes in utf8: c3 a9
>> >>
>> >> U+00E9 é c3 a9 LATIN SMALL LETTER E WITH ACUTE
>> >>
>> >>
>> >> (http://www.utf8-chartable.de/)
>> >>
>> >> On Fri, Aug 14, 2015 at 6:23 PM, Éric Pailleau <
>> eric.pailleau@REDACTED> wrote:
>> >> Hello,
>> >> Please precise what Erlang release you are using. Utf8 came lately in
>> Erlang.
>> >> Regards
>> >>
>> >> Le 14 août 2015 13:49, Alexander Turkin <snowwlex@REDACTED> a écrit :
>> >> >
>> >> > Dear list,
>> >> >
>> >> >
>> >> > I've got a problem with unicode & xmerl library.
>> >> >
>> >> > Input data for xmerl is utf-8 encoded xml, and what I've got as the
>> result is encoded latin1. But I need utf8!
>> >> >
>> >> >
>> >> > EXAMPLES
>> >> >
>> >> > Body = <<"<?xml version=\"1.0\"
>> encoding=\"UTF-8\"?><response><value>René</value></response>"/utf8>>.
>> >> >
>> >> > (for the sake of portability here is term_to_binary(Body):
>> >> >
>> >> > <<131,109,0,0,0,79,60,63,120,109,108,32,118,101,114,115,
>> >> > 105,111,110,61,34,49,46,48,34,32,101,110,99,111,100,105,
>> >> > 110,103,61,34,85,84,70,45,56,34,63,62,60,114,101,115,
>> >> > 112,111,110,115,101,62,60,118,97,108,117,101,62,82,101,
>> >> > 110,195,169,60,47,118,97,108,117,101,62,60,47,114,101,
>> >> > 115,112,111,110,115,101,62>>
>> >> >
>> >> >
>> >> >
>> >> > (1):
>> >> >
>> >> > When I do
>> >> >
>> >> > xmerl_scan:string(binary_to_list(Body)).
>> >> >
>> >> > it returns
>> >> >
>> >> > {#xmlElement{name = response,expanded_name = response,
>> >> > nsinfo = [],
>> >> > namespace = #xmlNamespace{default = [],nodes = []},
>> >> > parents = [],pos = 1,attributes = [],
>> >> > content = [#xmlElement{name = value,expanded_name =
>> value,
>> >> > nsinfo = [],
>> >> > namespace =
>> #xmlNamespace{default = [],nodes = []},
>> >> > parents = [{response,1}],
>> >> > pos = 1,attributes = [],
>> >> > content = [#xmlText{parents =
>> [{value,1},{response,1}],
>> >> > pos =
>> 1,language = [],
>> >> >
>> >> >
>> >> > value =
>> "René",
>> >> >
>> >> >
>> >> > type =
>> text}],
>> >> > language = [],xmlbase =
>> "/Users/aturkin/ws/",
>> >> > elementdef = undeclared}],
>> >> > language = [],xmlbase = "/Users/aturkin/ws/",
>> >> > elementdef = undeclared},
>> >> > []}
>> >> >
>> >> >
>> >> > So, note there is `value = "René"` string, and it uses [233] symbol,
>> which is latin1.
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > (2):
>> >> >
>> >> > xmerl_scan:string(xmerl_ucs:to_utf8(binary_to_list(Body)))
>> >> >
>> >> > returns
>> >> >
>> >> > {#xmlElement{name = response,expanded_name = response,
>> >> > nsinfo = [],
>> >> > namespace = #xmlNamespace{default = [],nodes = []},
>> >> > parents = [],pos = 1,attributes = [],
>> >> > content = [#xmlElement{name = value,expanded_name =
>> value,
>> >> > nsinfo = [],
>> >> > namespace =
>> #xmlNamespace{default = [],nodes = []},
>> >> > parents = [{response,1}],
>> >> > pos = 1,attributes = [],
>> >> > content = [#xmlText{parents =
>> [{value,1},{response,1}],
>> >> > pos =
>> 1,language = [],
>> >> >
>> >> >
>> >> > value =
>> "René",
>> >> >
>> >> >
>> >> > type =
>> text}],
>> >> > language = [],xmlbase =
>> "/Users/aturkin/ws/",
>> >> > elementdef = undeclared}],
>> >> > language = [],xmlbase = "/Users/aturkin/ws/",
>> >> > elementdef = undeclared},
>> >> > []}
>> >> >
>> >> > Now `value = "René"`, so 2 bytes are used to code this symbol, and
>> this is utf-8.
>> >> >
>> >> > So in (2) I get what I need, but why I need to force that conversion
>> for xmerl?
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > QUESTIONS
>> >> >
>> >> > 1. I don't understand why xmerl_scan allows you to set input
>> encoding, but it looks like there is no way to set output encoding. Is
>> there any way to make xmerl_scan to return utf8 instead of latin1?
>> >> >
>> >> > 2. How is that happen, that in (1) it does conversion utf-8 ->
>> latin1, and in (2) it's utf-8?
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Best Regards,
>> >> > Alex Turkin
>> >> _______________________________________________
>> >> erlang-questions mailing list
>> >> erlang-questions@REDACTED
>> >> http://erlang.org/mailman/listinfo/erlang-questions
>> >>
>> >>
>> >>
>> >>
>> >> --
>> >> Best Regards,
>> >> Alex Turkin
>> >> _______________________________________________
>> >> erlang-questions mailing list
>> >> erlang-questions@REDACTED
>> >> http://erlang.org/mailman/listinfo/erlang-questions
>> >
>> >
>> >
>> >
>> > --
>> > Best Regards,
>> > Alex Turkin
>>
>>
>
>
> --
> Best Regards,
> Alex Turkin
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20150818/662e7041/attachment.htm>
More information about the erlang-questions
mailing list