<div dir="ltr"><div>Hi Loïc,</div><div><br></div>It is a valid unicode code point (<a href="http://en.wikipedia.org/wiki/%C3%9C">http://en.wikipedia.org/wiki/%C3%9C</a>) . If it wasn't I would have expect either the output of line 1 (which says it is a utf-8 encoded binary), to show something different, or at least the "characters_to_list" call to fail.<div><br></div><div>//Daniel</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Mar 27, 2015 at 11:27 AM, Loïc Hoguin <span dir="ltr"><<a href="mailto:essen@ninenines.eu" target="_blank">essen@ninenines.eu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hello,<br>
<br>
252 (what your "ü" gives you) is not a valid Unicode code point.<br>
<br>
See <a href="https://en.wikipedia.org/wiki/UTF-8#Description" target="_blank">https://en.wikipedia.org/wiki/<u></u>UTF-8#Description</a><br>
<br>
"One-byte codes are used only for the ASCII values 0 through 127."<br>
<br>
Guessing the file you read is actually latin1 and not UTF-8.<div><div class="h5"><br>
<br>
On 03/27/2015 10:52 AM, Daniel Abrahamsson wrote:<br>
</div></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div class="h5">
Hi,<br>
<br>
I'm bit confused about the behaviour of xmerl_scan when dealing with<br>
utf-8 data. In short, the XML parser chokes when it encounters a "ü",<br>
but not if I specify the encoding as "latin1". Other parsers in other<br>
languages (e.g. nokogiri in Ruby) seem to handle this just fine. I've<br>
also run the sample XML through various web validators, and they all say<br>
it is valid.<br>
<br>
Is this a bug in xmerl or am I missing something obvious?<br>
<br>
Example session below:<br>
<br>
danabr@danabr ~> echo -n "<?xml version=\"1.0\"<br>
encoding=\"UTF-8\"?><root><u></u>ümlaut</root>" > /tmp/test.xml<br>
danabr@danabr ~> erl<br>
Erlang/OTP 17 [erts-6.3] [source] [64-bit] [smp:4:4] [async-threads:10]<br>
[hipe] [kernel-poll:false]<br>
<br>
Eshell V6.3 (abort with ^G)<br>
1> {ok, S} = file:read_file("/tmp/test.xml"<u></u>).<br>
{ok,<<"<?xml version=\"1.0\"<br>
encoding=\"UTF-8\"?><root><u></u>ümlaut</root>"/utf8>>}<br>
2> xmerl_scan:string(unicode:<u></u>characters_to_list(S)).<br>
3414- fatal: {error,{wfc_Legal_Character,{<u></u>error,{bad_character,252}}}}<br>
** exception exit:<br>
{fatal,{{error,{wfc_Legal_<u></u>Character,{error,{bad_<u></u>character,252}}}},<br>
{file,file_name_unknown},<br>
{line,1},<br>
{col,47}}}<br>
in function xmerl_scan:fatal/2 (xmerl_scan.erl, line 4102)<br>
in call from xmerl_scan:scan_char_data/5 (xmerl_scan.erl, line 2703)<br>
in call from xmerl_scan:scan_content/11 (xmerl_scan.erl, line 2615)<br>
in call from xmerl_scan:scan_element/12 (xmerl_scan.erl, line 2128)<br>
in call from xmerl_scan:scan_document/2 (xmerl_scan.erl, line 570)<br>
in call from xmerl_scan:string/2 (xmerl_scan.erl, line 286)<br>
3> xmerl_scan:string(unicode:<u></u>characters_to_list(S), [{encoding, "latin1"}]).<br>
{{xmlElement,root,root,[],<br>
{xmlNamespace,[],[]},<br>
[],1,[],<br>
[{xmlText,[{root,1}],1,[],"<u></u>ümlaut",text}],<br>
[],"/home/danabr",undeclared},<br>
[]}<br>
<br>
//Daniel Abrahamsson<br>
<br>
<br></div></div>
______________________________<u></u>_________________<br>
erlang-questions mailing list<br>
<a href="mailto:erlang-questions@erlang.org" target="_blank">erlang-questions@erlang.org</a><br>
<a href="http://erlang.org/mailman/listinfo/erlang-questions" target="_blank">http://erlang.org/mailman/<u></u>listinfo/erlang-questions</a><br>
<br><span class="HOEnZb"><font color="#888888">
</font></span></blockquote><span class="HOEnZb"><font color="#888888">
<br>
-- <br>
Loïc Hoguin<br>
<a href="http://ninenines.eu" target="_blank">http://ninenines.eu</a><br>
</font></span></blockquote></div><br></div>