[erlang-questions] utf-8 and xmerl

Tony Rogvall tony@REDACTED
Mon Aug 17 17:26:05 CEST 2015


Ok I see.

So you expected to find utf8 in the text value instead of the unicode ( 233 is the same in latin1 and unicode btw )
But that is not how the xmerl works. It represents the characters in unicode iso-10646.

Here is an example to get you a utf8 output. Bin is your binary.

Term = binary_to_term(Bin).
{Content,_} = xmerl_scan:string(binary_to_list(Term)).
UnicodeChars = xmerl:export([Content], xmerl_xml).
Utf8Bin = unicode:characters_to_binary(UnicodeChars).

I guess you could scan the xml structure (Content) and convert the text values
to utf8 strings. But that would complicate the process when you want
to format the output.

Why do you want to have utf8 in the values ? unicode is much more
generic and you select the encoding on the way out!

/Tony


> On 17 aug 2015, at 15:56, Alexander Turkin <snowwlex@REDACTED> wrote:
> 
> Hey Tony,
> 
> Yes, this binary is in utf8 - and it is what it's being fed to xmerl library, which returns it in the other encoding by some reasons.
> 
> On 17 August 2015 at 13:55, Tony Rogvall <tony@REDACTED> wrote:
> Sorry for the empty message :-)
> 
> But the coding you are looking for is already in your binary.
> 
>  <<131,109,0,0,0,79,60,63,120,109,108,32,118,101,114,115,
>    105,111,110,61,34,49,46,48,34,32,101,110,99,111,100,105,
>    110,103,61,34,85,84,70,45,56,34,63,62,60,114,101,115,
>    112,111,110,115,101,62,60,118,97,108,117,101,62,82,101,
>    110,195,169,60,47,118,97,108,117,101,62,60,47,114,101,
>    115,112,111,110,115,101,62>>
> 
> 195,169  = c3 a9
> 
> /Tony
> 
> 
>> On 17 aug 2015, at 14:39, Alexander Turkin <snowwlex@REDACTED> wrote:
>> 
>> Hi Hynek,
>> 
>> On 15 August 2015 at 08:30, Hynek Vychodil <vychodil.hynek@REDACTED> wrote:
>> The same result is in R18 and it it correct result. Letter é has unicode 233 see http://unicode-table.com/en/#00E9
>> 
>> Yeah, it has U+00E9 (= 233) code point number, but it is coded in 2 bytes in utf8: c3 a9
>> 
>> U+00E9	é	c3 a9	LATIN SMALL LETTER E WITH ACUTE
>> 
>> 
>> (http://www.utf8-chartable.de/)
>> 
>> On Fri, Aug 14, 2015 at 6:23 PM, Éric Pailleau <eric.pailleau@REDACTED> wrote:
>> Hello,
>> Please precise what Erlang release you are using. Utf8 came lately in Erlang.
>> Regards
>> 
>> Le 14 août 2015 13:49, Alexander Turkin <snowwlex@REDACTED> a écrit :
>> >
>> > Dear list,
>> >
>> >
>> > I've got a problem with unicode & xmerl library.
>> >
>> > Input data for xmerl is utf-8 encoded xml, and what I've got as the result is encoded latin1. But I need utf8!
>> >
>> >
>> > EXAMPLES
>> >
>> > Body = <<"<?xml version=\"1.0\" encoding=\"UTF-8\"?><response><value>René</value></response>"/utf8>>.
>> >
>> > (for the sake of portability here is term_to_binary(Body):
>> >
>> > <<131,109,0,0,0,79,60,63,120,109,108,32,118,101,114,115,
>> >   105,111,110,61,34,49,46,48,34,32,101,110,99,111,100,105,
>> >   110,103,61,34,85,84,70,45,56,34,63,62,60,114,101,115,
>> >   112,111,110,115,101,62,60,118,97,108,117,101,62,82,101,
>> >   110,195,169,60,47,118,97,108,117,101,62,60,47,114,101,
>> >   115,112,111,110,115,101,62>>
>> >
>> >
>> >
>> > (1):
>> >
>> > When I do
>> >
>> > xmerl_scan:string(binary_to_list(Body)).
>> >
>> > it returns
>> >
>> > {#xmlElement{name = response,expanded_name = response,
>> >              nsinfo = [],
>> >              namespace = #xmlNamespace{default = [],nodes = []},
>> >              parents = [],pos = 1,attributes = [],
>> >              content = [#xmlElement{name = value,expanded_name = value,
>> >                                     nsinfo = [],
>> >                                     namespace = #xmlNamespace{default = [],nodes = []},
>> >                                     parents = [{response,1}],
>> >                                     pos = 1,attributes = [],
>> >                                     content = [#xmlText{parents = [{value,1},{response,1}],
>> >                                                         pos = 1,language = [],
>> >
>> >
>> >                                                         value = "René",
>> >
>> >
>> >                                                         type = text}],
>> >                                     language = [],xmlbase = "/Users/aturkin/ws/",
>> >                                     elementdef = undeclared}],
>> >              language = [],xmlbase = "/Users/aturkin/ws/",
>> >              elementdef = undeclared},
>> >  []}
>> >
>> >
>> > So, note there is `value = "René"` string, and it uses [233] symbol, which is latin1.
>> >
>> >
>> >
>> >
>> > (2):
>> >
>> > xmerl_scan:string(xmerl_ucs:to_utf8(binary_to_list(Body)))
>> >
>> > returns
>> >
>> > {#xmlElement{name = response,expanded_name = response,
>> >              nsinfo = [],
>> >              namespace = #xmlNamespace{default = [],nodes = []},
>> >              parents = [],pos = 1,attributes = [],
>> >              content = [#xmlElement{name = value,expanded_name = value,
>> >                                     nsinfo = [],
>> >                                     namespace = #xmlNamespace{default = [],nodes = []},
>> >                                     parents = [{response,1}],
>> >                                     pos = 1,attributes = [],
>> >                                     content = [#xmlText{parents = [{value,1},{response,1}],
>> >                                                         pos = 1,language = [],
>> >
>> >
>> >                                                         value = "René",
>> >
>> >
>> >                                                         type = text}],
>> >                                     language = [],xmlbase = "/Users/aturkin/ws/",
>> >                                     elementdef = undeclared}],
>> >              language = [],xmlbase = "/Users/aturkin/ws/",
>> >              elementdef = undeclared},
>> >  []}
>> >
>> > Now `value = "René"`, so 2 bytes are used to code this symbol, and this is utf-8.
>> >
>> > So in (2) I get what I need, but why I need to force that conversion for xmerl?
>> >
>> >
>> >
>> >
>> > QUESTIONS
>> >
>> > 1. I don't understand why xmerl_scan allows you to set input encoding, but it looks like there is no way to set output encoding. Is there any way to make xmerl_scan to return utf8 instead of latin1?
>> >
>> > 2. How is that happen, that in (1) it does conversion utf-8 -> latin1, and in (2) it's utf-8?
>> >
>> >
>> >
>> >
>> > --
>> > Best Regards,
>> > Alex Turkin
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
>> 
>> 
>> 
>> 
>> --
>> Best Regards,
>> Alex Turkin
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
> 
> 
> 
> 
> --
> Best Regards,
> Alex Turkin

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20150817/4a4a852a/attachment.bin>


More information about the erlang-questions mailing list