<div dir="ltr"><font size="4" color="#000000">Hi,</font><div><font size="4" color="#000000"><br></font></div><div style><font size="4" color="#000000">I am trying to get the basics of reading data from a utf-8 file.</font></div>
<div style><font size="4" color="#000000">From the docs (<a href="http://www.erlang.org/doc/apps/stdlib/unicode_usage.html#id62290">http://www.erlang.org/doc/apps/stdlib/unicode_usage.html#id62290</a>), it says:</font></div>
<div style><br></div><div style><font size="4" color="#000000">"<span style="font-family:Verdana,Arial,Helvetica,sans-serif">It is slightly confusing that a file opened with e.g. </span><span class="" style="font-family:Courier,monospace">file:open(Name,[read,{encoding,utf8}])</span><span style="font-family:Verdana,Arial,Helvetica,sans-serif">, cannot be properly read using </span><span class="" style="font-family:Courier,monospace">file:read(File,N)</span><span style="font-family:Verdana,Arial,Helvetica,sans-serif"> but you have to use the </span><span class="" style="font-family:Courier,monospace">io</span><span style="font-family:Verdana,Arial,Helvetica,sans-serif"> module to retrieve the Unicode data from it.</span>"</font></div>
<div style><br></div><div style><br></div><div style><font size="4">So I tested this out by writing some unicode to a file:</font></div><div style><div><font face="courier new, monospace">Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:2:2] [async-threads:10] [hipe] [kernel-poll:false]</font></div>
<div><font face="courier new, monospace"><br></font></div><div><font face="courier new, monospace">Eshell V5.10.1 (abort with ^G)</font></div><div><font face="courier new, monospace">1> {ok, InputDevice} = file:open("/tmp/test.utf8", [write, {encoding, unicode}]).</font></div>
<div><font face="courier new, monospace">{ok,<0.35.0>}</font></div><div><font face="courier new, monospace">2> io:put_chars(InputDevice, <<"Юникод"/utf8>>).</font></div><div><font face="courier new, monospace">ok</font></div>
<div><font face="courier new, monospace">3> file:close(InputDevice). </font></div><div><br></div><div style><font size="4" color="#000000">and then read this back using io:get_line/2</font></div>
<div style><div><font face="courier new, monospace">4> {ok, OutputDevice} = file:open("/tmp/test.utf8", [read, {encoding, unicode}]).</font></div><div><font face="courier new, monospace">{ok,<0.39.0>}</font></div>
<div><font face="courier new, monospace">5> io:get_line(OutputDevice, "").</font></div><div><font face="courier new, monospace">"Юникод"</font></div><div><font face="courier new, monospace">6> file:close(OutputDevice). </font> </div>
<div><br></div></div></div><div style><font size="4">So far so good, but I also get the same result when I read in the file using file:read_line/1</font></div><div style><div><font face="courier new, monospace">7> f().</font></div>
<div><font face="courier new, monospace">ok</font></div><div><font face="courier new, monospace">8> {ok, InputDevice} = file:open("/tmp/test.utf8", [read, {encoding, unicode}]).</font></div><div><font face="courier new, monospace">{ok,<0.44.0>}</font></div>
<div><font face="courier new, monospace">9> file:read_line(InputDevice).</font></div><div><font face="courier new, monospace">{ok,"Юникод"}</font></div><div><font face="courier new, monospace">10> file:close(InputDevice). </font> </div>
<div><br></div></div><div style><br></div><div style><font size="4">So is it really wrong to use file:read_line/1 ? It seems to give the correct result. I suspect that file:read_line/1 is just reading in a list of bytes from the file. The unicode string in the example is represented by the bytes show here: </font></div>
<div style><div><font face="courier new, monospace">11> unicode:characters_to_binary("Юникод").</font></div><div><font face="courier new, monospace"><<208,174,208,189,208,184,208,186,208,190,208,180>></font></div>
<div style="font-size:large"><br></div><div style="font-size:large">Which (using unicode:characters_to_list) translates to:</div><div><font face="courier new, monospace">[1070,1085,1080,1082,1086,1076] = "Юникод"</font></div>
</div><div style><br></div><div style><font size="4">Another reason I ask this question is that it seems wrong to use io:get_line/2 as this requires a value for a Prompt which is not used when reading from a file.</font></div>
<div style><font size="4"><br></font></div><div style><font size="4">Thanks in advance</font></div><div style><font size="4">Philip</font></div></div>