<html><head><meta http-equiv="Content-Type" content="text/html charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><div class=""><br class=""><div><blockquote type="cite" class=""><div class="">On 17 aug 2015, at 14:39, Alexander Turkin <<a href="mailto:snowwlex@gmail.com" class="">snowwlex@gmail.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class=""><div class="gmail_extra">Hi Hynek, </div><div class="gmail_extra"><br class=""><div class="gmail_quote">On 15 August 2015 at 08:30, Hynek Vychodil <span dir="ltr" class=""><<a href="mailto:vychodil.hynek@gmail.com" target="_blank" class="">vychodil.hynek@gmail.com</a>></span> wrote:<br class=""><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr" class="">The same result is in R18 and it it correct result. Letter é has unicode 233 see <a href="http://unicode-table.com/en/#00E9" target="_blank" class="">http://unicode-table.com/en/#00E9</a></div></blockquote><div class=""><br class=""></div><div class="">Yeah, it has U+00E9 (= 233) code point number, but it is coded in 2 bytes in utf8: c3 a9 </div><div class=""><br class=""></div><div class=""><table class="codetable" style="border-collapse: collapse; font-family: Times; font-size: inherit;"><tbody class=""><tr class="cod" style="background-color:rgb(248,248,248)"><td class="cpt" style="text-align:center;border:1px solid black;padding-left:0.5em;padding-right:0.5em">U+00E9</td><td class="char" style="text-align:center;border:1px solid black;padding-left:0.5em;padding-right:0.5em">é</td><td class="utf8" style="text-align:center;border:1px solid black;padding-left:0.5em;padding-right:0.5em">c3 a9</td><td class="name" style="margin-left:1.5em;border:1px solid black;padding-left:0.5em;padding-right:0.5em">LATIN SMALL LETTER E WITH ACUTE<br class=""></td></tr></tbody></table></div><div class=""> </div><div class=""><br class=""></div><div class="">(<a href="http://www.utf8-chartable.de/" class="">http://www.utf8-chartable.de/</a>) </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="gmail_extra"><br class=""><div class="gmail_quote"><div class=""><div class="h5">On Fri, Aug 14, 2015 at 6:23 PM, Éric Pailleau <span dir="ltr" class=""><<a href="mailto:eric.pailleau@wanadoo.fr" target="_blank" class="">eric.pailleau@wanadoo.fr</a>></span> wrote:<br class=""></div></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class=""><div class="h5">Hello,<br class="">
Please precise what Erlang release you are using. Utf8 came lately in Erlang.<br class="">
Regards<br class="">
<div class=""><div class=""><br class="">
Le 14 août 2015 13:49, Alexander Turkin <<a href="mailto:snowwlex@gmail.com" target="_blank" class="">snowwlex@gmail.com</a>> a écrit :<br class="">
><br class="">
> Dear list,<br class="">
><br class="">
><br class="">
> I've got a problem with unicode & xmerl library.<br class="">
><br class="">
> Input data for xmerl is utf-8 encoded xml, and what I've got as the result is encoded latin1. But I need utf8!<br class="">
><br class="">
><br class="">
> EXAMPLES<br class="">
><br class="">
> Body = <<"<?xml version=\"1.0\" encoding=\"UTF-8\"?><response><value>René</value></response>"/utf8>>.<br class="">
><br class="">
> (for the sake of portability here is term_to_binary(Body): <br class="">
><br class="">
> <<131,109,0,0,0,79,60,63,120,109,108,32,118,101,114,115,<br class="">
>   105,111,110,61,34,49,46,48,34,32,101,110,99,111,100,105,<br class="">
>   110,103,61,34,85,84,70,45,56,34,63,62,60,114,101,115,<br class="">
>   112,111,110,115,101,62,60,118,97,108,117,101,62,82,101,<br class="">
>   110,195,169,60,47,118,97,108,117,101,62,60,47,114,101,<br class="">
>   115,112,111,110,115,101,62>><br class="">
><br class="">
><br class="">
><br class="">
> (1):<br class="">
><br class="">
> When I do <br class="">
><br class="">
> xmerl_scan:string(binary_to_list(Body)).<br class="">
><br class="">
> it returns <br class="">
><br class="">
> {#xmlElement{name = response,expanded_name = response,<br class="">
>              nsinfo = [],<br class="">
>              namespace = #xmlNamespace{default = [],nodes = []},<br class="">
>              parents = [],pos = 1,attributes = [],<br class="">
>              content = [#xmlElement{name = value,expanded_name = value,<br class="">
>                                     nsinfo = [],<br class="">
>                                     namespace = #xmlNamespace{default = [],nodes = []},<br class="">
>                                     parents = [{response,1}],<br class="">
>                                     pos = 1,attributes = [],<br class="">
>                                     content = [#xmlText{parents = [{value,1},{response,1}],<br class="">
>                                                         pos = 1,language = [],<br class="">
><br class="">
><br class="">
>                                                         value = "René",<br class="">
><br class="">
><br class="">
>                                                         type = text}],<br class="">
>                                     language = [],xmlbase = "/Users/aturkin/ws/",<br class="">
>                                     elementdef = undeclared}],<br class="">
>              language = [],xmlbase = "/Users/aturkin/ws/",<br class="">
>              elementdef = undeclared},<br class="">
>  []}<br class="">
><br class="">
><br class="">
> So, note there is `value = "René"` string, and it uses [233] symbol, which is latin1.<br class="">
><br class="">
><br class="">
><br class="">
><br class="">
> (2):<br class="">
><br class="">
> xmerl_scan:string(xmerl_ucs:to_utf8(binary_to_list(Body)))<br class="">
><br class="">
> returns <br class="">
><br class="">
> {#xmlElement{name = response,expanded_name = response,<br class="">
>              nsinfo = [],<br class="">
>              namespace = #xmlNamespace{default = [],nodes = []},<br class="">
>              parents = [],pos = 1,attributes = [],<br class="">
>              content = [#xmlElement{name = value,expanded_name = value,<br class="">
>                                     nsinfo = [],<br class="">
>                                     namespace = #xmlNamespace{default = [],nodes = []},<br class="">
>                                     parents = [{response,1}],<br class="">
>                                     pos = 1,attributes = [],<br class="">
>                                     content = [#xmlText{parents = [{value,1},{response,1}],<br class="">
>                                                         pos = 1,language = [],<br class="">
><br class="">
><br class="">
>                                                         value = "René",<br class="">
><br class="">
><br class="">
>                                                         type = text}],<br class="">
>                                     language = [],xmlbase = "/Users/aturkin/ws/",<br class="">
>                                     elementdef = undeclared}],<br class="">
>              language = [],xmlbase = "/Users/aturkin/ws/",<br class="">
>              elementdef = undeclared},<br class="">
>  []}<br class="">
><br class="">
> Now `value = "René"`, so 2 bytes are used to code this symbol, and this is utf-8.<br class="">
><br class="">
> So in (2) I get what I need, but why I need to force that conversion for xmerl? <br class="">
><br class="">
><br class="">
><br class="">
><br class="">
> QUESTIONS<br class="">
><br class="">
> 1. I don't understand why xmerl_scan allows you to set input encoding, but it looks like there is no way to set output encoding. Is there any way to make xmerl_scan to return utf8 instead of latin1?<br class="">
><br class="">
> 2. How is that happen, that in (1) it does conversion utf-8 -> latin1, and in (2) it's utf-8?<br class="">
><br class="">
><br class="">
><br class="">
><br class="">
> --<br class="">
> Best Regards,<br class="">
> Alex Turkin<br class="">
</div></div></div></div>_______________________________________________<br class="">
erlang-questions mailing list<br class="">
<a href="mailto:erlang-questions@erlang.org" target="_blank" class="">erlang-questions@erlang.org</a><br class="">
<a href="http://erlang.org/mailman/listinfo/erlang-questions" rel="noreferrer" target="_blank" class="">http://erlang.org/mailman/listinfo/erlang-questions</a><br class="">
</blockquote></div><br class=""></div>
</blockquote></div><br class=""><br clear="all" class=""><div class=""><br class=""></div>-- <br class=""><div class="gmail_signature">Best Regards,<div class="">Alex Turkin</div></div>
</div></div>
_______________________________________________<br class="">erlang-questions mailing list<br class=""><a href="mailto:erlang-questions@erlang.org" class="">erlang-questions@erlang.org</a><br class="">http://erlang.org/mailman/listinfo/erlang-questions<br class=""></div></blockquote></div><br class=""></div></body></html>