<div dir="ltr"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span style="color:rgb(0,0,0);font-size:12.8000001907349px">Why do you want to have utf8 in the values ? unicode is much more<br></span><span style="color:rgb(0,0,0);font-size:12.8000001907349px">generic and you select the encoding on the way out!</span></blockquote><div><br></div><div>The thing is that I need to convert it to json, and mochijson (and jsx as well) doesn't understand this<span style="color:rgb(0,0,0);font-size:12.8000001907349px"><b> iso-10646. </b>And when mochijson gets something not utf-8, it throws `</span><font color="#000000"><span style="font-size:12.8000001907349px">{ucs,{bad_utf8_character_code}}}` error.</span></font></div><div><font color="#000000"><span style="font-size:12.8000001907349px"><br></span></font></div><div><span style="color:rgb(0,0,0);font-size:12.8000001907349px"><b><br></b></span></div><div><span style="color:rgb(0,0,0);font-size:12.8000001907349px"><b><br></b></span></div></div><div class="gmail_extra"><br><div class="gmail_quote">On 17 August 2015 at 16:26, Tony Rogvall <span dir="ltr"><<a href="mailto:tony@rogvall.se" target="_blank">tony@rogvall.se</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Ok I see.<br>

<br>

So you expected to find utf8 in the text value instead of the unicode ( 233 is the same in latin1 and unicode btw )<br>

But that is not how the xmerl works. It represents the characters in unicode iso-10646.<br>

<br>

Here is an example to get you a utf8 output. Bin is your binary.<br>

<br>

Term = binary_to_term(Bin).<br>

{Content,_} = xmerl_scan:string(binary_to_list(Term)).<br>

UnicodeChars = xmerl:export([Content], xmerl_xml).<br>

Utf8Bin = unicode:characters_to_binary(UnicodeChars).<br>

<br>

I guess you could scan the xml structure (Content) and convert the text values<br>

to utf8 strings. But that would complicate the process when you want<br>

to format the output.<br>

<br>

Why do you want to have utf8 in the values ? unicode is much more<br>

generic and you select the encoding on the way out!<br>

<span class="HOEnZb"><font color="#888888"><br>

/Tony<br>

</font></span><div class="HOEnZb"><div class="h5"><br>

<br>

> On 17 aug 2015, at 15:56, Alexander Turkin <<a href="mailto:snowwlex@gmail.com">snowwlex@gmail.com</a>> wrote:<br>

><br>

> Hey Tony,<br>

><br>

> Yes, this binary is in utf8 - and it is what it's being fed to xmerl library, which returns it in the other encoding by some reasons.<br>

><br>

> On 17 August 2015 at 13:55, Tony Rogvall <<a href="mailto:tony@rogvall.se">tony@rogvall.se</a>> wrote:<br>

> Sorry for the empty message :-)<br>

><br>

> But the coding you are looking for is already in your binary.<br>

><br>

>  <<131,109,0,0,0,79,60,63,120,109,108,32,118,101,114,115,<br>

>    105,111,110,61,34,49,46,48,34,32,101,110,99,111,100,105,<br>

>    110,103,61,34,85,84,70,45,56,34,63,62,60,114,101,115,<br>

>    112,111,110,115,101,62,60,118,97,108,117,101,62,82,101,<br>

>    110,195,169,60,47,118,97,108,117,101,62,60,47,114,101,<br>

>    115,112,111,110,115,101,62>><br>

><br>

> 195,169  = c3 a9<br>

><br>

> /Tony<br>

><br>

><br>

>> On 17 aug 2015, at 14:39, Alexander Turkin <<a href="mailto:snowwlex@gmail.com">snowwlex@gmail.com</a>> wrote:<br>

>><br>

>> Hi Hynek,<br>

>><br>

>> On 15 August 2015 at 08:30, Hynek Vychodil <<a href="mailto:vychodil.hynek@gmail.com">vychodil.hynek@gmail.com</a>> wrote:<br>

>> The same result is in R18 and it it correct result. Letter é has unicode 233 see <a href="http://unicode-table.com/en/#00E9" rel="noreferrer" target="_blank">http://unicode-table.com/en/#00E9</a><br>

>><br>

>> Yeah, it has U+00E9 (= 233) code point number, but it is coded in 2 bytes in utf8: c3 a9<br>

>><br>

>> U+00E9       é       c3 a9   LATIN SMALL LETTER E WITH ACUTE<br>

>><br>

>><br>

>> (<a href="http://www.utf8-chartable.de/" rel="noreferrer" target="_blank">http://www.utf8-chartable.de/</a>)<br>

>><br>

>> On Fri, Aug 14, 2015 at 6:23 PM, Éric Pailleau <<a href="mailto:eric.pailleau@wanadoo.fr">eric.pailleau@wanadoo.fr</a>> wrote:<br>

>> Hello,<br>

>> Please precise what Erlang release you are using. Utf8 came lately in Erlang.<br>

>> Regards<br>

>><br>

>> Le 14 août 2015 13:49, Alexander Turkin <<a href="mailto:snowwlex@gmail.com">snowwlex@gmail.com</a>> a écrit :<br>

>> ><br>

>> > Dear list,<br>

>> ><br>

>> ><br>

>> > I've got a problem with unicode & xmerl library.<br>

>> ><br>

>> > Input data for xmerl is utf-8 encoded xml, and what I've got as the result is encoded latin1. But I need utf8!<br>

>> ><br>

>> ><br>

>> > EXAMPLES<br>

>> ><br>

>> > Body = <<"<?xml version=\"1.0\" encoding=\"UTF-8\"?><response><value>René</value></response>"/utf8>>.<br>

>> ><br>

>> > (for the sake of portability here is term_to_binary(Body):<br>

>> ><br>

>> > <<131,109,0,0,0,79,60,63,120,109,108,32,118,101,114,115,<br>

>> >   105,111,110,61,34,49,46,48,34,32,101,110,99,111,100,105,<br>

>> >   110,103,61,34,85,84,70,45,56,34,63,62,60,114,101,115,<br>

>> >   112,111,110,115,101,62,60,118,97,108,117,101,62,82,101,<br>

>> >   110,195,169,60,47,118,97,108,117,101,62,60,47,114,101,<br>

>> >   115,112,111,110,115,101,62>><br>

>> ><br>

>> ><br>

>> ><br>

>> > (1):<br>

>> ><br>

>> > When I do<br>

>> ><br>

>> > xmerl_scan:string(binary_to_list(Body)).<br>

>> ><br>

>> > it returns<br>

>> ><br>

>> > {#xmlElement{name = response,expanded_name = response,<br>

>> >              nsinfo = [],<br>

>> >              namespace = #xmlNamespace{default = [],nodes = []},<br>

>> >              parents = [],pos = 1,attributes = [],<br>

>> >              content = [#xmlElement{name = value,expanded_name = value,<br>

>> >                                     nsinfo = [],<br>

>> >                                     namespace = #xmlNamespace{default = [],nodes = []},<br>

>> >                                     parents = [{response,1}],<br>

>> >                                     pos = 1,attributes = [],<br>

>> >                                     content = [#xmlText{parents = [{value,1},{response,1}],<br>

>> >                                                         pos = 1,language = [],<br>

>> ><br>

>> ><br>

>> >                                                         value = "René",<br>

>> ><br>

>> ><br>

>> >                                                         type = text}],<br>

>> >                                     language = [],xmlbase = "/Users/aturkin/ws/",<br>

>> >                                     elementdef = undeclared}],<br>

>> >              language = [],xmlbase = "/Users/aturkin/ws/",<br>

>> >              elementdef = undeclared},<br>

>> >  []}<br>

>> ><br>

>> ><br>

>> > So, note there is `value = "René"` string, and it uses [233] symbol, which is latin1.<br>

>> ><br>

>> ><br>

>> ><br>

>> ><br>

>> > (2):<br>

>> ><br>

>> > xmerl_scan:string(xmerl_ucs:to_utf8(binary_to_list(Body)))<br>

>> ><br>

>> > returns<br>

>> ><br>

>> > {#xmlElement{name = response,expanded_name = response,<br>

>> >              nsinfo = [],<br>

>> >              namespace = #xmlNamespace{default = [],nodes = []},<br>

>> >              parents = [],pos = 1,attributes = [],<br>

>> >              content = [#xmlElement{name = value,expanded_name = value,<br>

>> >                                     nsinfo = [],<br>

>> >                                     namespace = #xmlNamespace{default = [],nodes = []},<br>

>> >                                     parents = [{response,1}],<br>

>> >                                     pos = 1,attributes = [],<br>

>> >                                     content = [#xmlText{parents = [{value,1},{response,1}],<br>

>> >                                                         pos = 1,language = [],<br>

>> ><br>

>> ><br>

>> >                                                         value = "RenÃ©",<br>

>> ><br>

>> ><br>

>> >                                                         type = text}],<br>

>> >                                     language = [],xmlbase = "/Users/aturkin/ws/",<br>

>> >                                     elementdef = undeclared}],<br>

>> >              language = [],xmlbase = "/Users/aturkin/ws/",<br>

>> >              elementdef = undeclared},<br>

>> >  []}<br>

>> ><br>

>> > Now `value = "RenÃ©"`, so 2 bytes are used to code this symbol, and this is utf-8.<br>

>> ><br>

>> > So in (2) I get what I need, but why I need to force that conversion for xmerl?<br>

>> ><br>

>> ><br>

>> ><br>

>> ><br>

>> > QUESTIONS<br>

>> ><br>

>> > 1. I don't understand why xmerl_scan allows you to set input encoding, but it looks like there is no way to set output encoding. Is there any way to make xmerl_scan to return utf8 instead of latin1?<br>

>> ><br>

>> > 2. How is that happen, that in (1) it does conversion utf-8 -> latin1, and in (2) it's utf-8?<br>

>> ><br>

>> ><br>

>> ><br>

>> ><br>

>> > --<br>

>> > Best Regards,<br>

>> > Alex Turkin<br>

>> _______________________________________________<br>

>> erlang-questions mailing list<br>

>> <a href="mailto:erlang-questions@erlang.org">erlang-questions@erlang.org</a><br>

>> <a href="http://erlang.org/mailman/listinfo/erlang-questions" rel="noreferrer" target="_blank">http://erlang.org/mailman/listinfo/erlang-questions</a><br>

>><br>

>><br>

>><br>

>><br>

>> --<br>

>> Best Regards,<br>

>> Alex Turkin<br>

>> _______________________________________________<br>

>> erlang-questions mailing list<br>

>> <a href="mailto:erlang-questions@erlang.org">erlang-questions@erlang.org</a><br>

>> <a href="http://erlang.org/mailman/listinfo/erlang-questions" rel="noreferrer" target="_blank">http://erlang.org/mailman/listinfo/erlang-questions</a><br>

><br>

><br>

><br>

><br>

> --<br>

> Best Regards,<br>

> Alex Turkin<br>

<br>

</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature">Best Regards,<div>Alex Turkin</div></div>

</div>