[erlang-questions] reading data from a utf-8 encoded file

Philip Clarke <>
Thu Jun 20 10:06:40 CEST 2013


I am trying to get the basics of reading data from a utf-8 file.
>From the docs (
http://www.erlang.org/doc/apps/stdlib/unicode_usage.html#id62290), it says:

"It is slightly confusing that a file opened with e.g.
file:open(Name,[read,{encoding,utf8}]), cannot be properly read using
file:read(File,N) but you have to use the io module to retrieve the Unicode
data from it."

So I tested this out by writing some unicode to a file:
Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:2:2] [async-threads:10]
[hipe] [kernel-poll:false]

Eshell V5.10.1  (abort with ^G)
1> {ok, InputDevice} = file:open("/tmp/test.utf8", [write, {encoding,
2> io:put_chars(InputDevice, <<"Юникод"/utf8>>).
3> file:close(InputDevice).

and then read this back using io:get_line/2
4> {ok, OutputDevice} = file:open("/tmp/test.utf8", [read, {encoding,
5> io:get_line(OutputDevice, "").
6> file:close(OutputDevice).

So far so good, but I also get the same result when I read in the file
using file:read_line/1
7> f().
8> {ok, InputDevice} = file:open("/tmp/test.utf8", [read, {encoding,
9> file:read_line(InputDevice).
10> file:close(InputDevice).

So is it really wrong to use file:read_line/1 ?  It seems to give the
correct result.  I suspect that file:read_line/1 is just reading in a list
of bytes from the file.  The unicode string in the example is represented
by the bytes show here:
11> unicode:characters_to_binary("Юникод").

Which (using unicode:characters_to_list) translates to:
[1070,1085,1080,1082,1086,1076] = "Юникод"

Another reason I ask this question is that it seems wrong to use
io:get_line/2 as this requires a value for a Prompt which is not used when
reading from a file.

Thanks in advance
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20130620/478484b3/attachment.html>

More information about the erlang-questions mailing list