[erlang-bugs] Misleading docs or implementation of file:read/2 and friends

Fri Jul 4 17:10:31 CEST 2014

Hello everyone,

I have found the documentation or implementation file:read/2 to be
misleading when working with unicode devices in binary mode. I will use
file:read_line/1 in the examples below but the issue applies to
file:read/2, file:pread/1 and etc.

$ echo "héllo" > foo

$ erl
1> {ok, F} = file:open("foo", [binary, unicode]).
{ok,<0.34.0>}
2> {ok, Bin} = file:read_line(F).
{ok,<<"héllo\n">>}
3> <<Bin/binary, 0>>.
<<104,233,108,108,111,10, 0>>

Not the result is not the one desired because I expected a binary
containing <<"héllo\n"/utf8>>, or more explicitly, I expected it to contain
the bytes <<195, 169>> instead of <<233>>. In other words, I got latin1 as
result. With char lists, we would get "héllo\n" but the function will fail
for any codepoint > 255.

Note this behaviour also happens if I use file:read_line/1 on any other IO
device set to unicode (like standard_io).

The trouble I have with the function is that it is aimed to byte-oriented
but it only really works for latin1 files. If you have a unicode file, the
behaviour seems to be broken for binaries, and it only works for a limited
range of codepoints when using char lists.

It is interesting to notice those functions use the old {get_line, Prompt}
messages which, according to the I/O protocol guide
<http://www.erlang.org/doc/apps/stdlib/io_protocol.html>, should not exist
beyond R15B (section 1.3). The same section above suggests to translate
{get_line, Prompt} to {get_line, latin1, Prompt} which seems to be the root
cause of the issues above: those functions were never meant to work with
unicode devices.

Unless I am terribly wrong, I can think of some ways to fix those functions:

1. Keep its dual aspect of returning bytes and/or characters but fix the
bug when working with unicode. This means the {get_line, Prompt} should
rather translate to {get_line, EncodingOfTheIODevice, Prompt}

2. Make them crash if the encoding of the device is not latin1. This means
we translate {get_line, Prompt} to {get_line, latin1, Prompt} only if the
encoding of the device is latin1.

3. Make it always work at the byte level, regardless of the encoding of the
IO device. This would require assigning new meaning to the {get_line,
Prompt} message, so I believe it is not going to happen (although it would
be a useful feature in my opinion).

Given that the original IO messages were never meant to work with unicode,
maybe 2) is the best way to go here. Both 1) and 2) would require a small
amendment to the current I/O protocol advice but I would argue it is
necessary to fix the current limitations/bugs when working with unicode.

*José Valim*
www.plataformatec.com.br
Skype: jv.ptec
Founder and Lead Developer
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20140704/447a2be8/attachment.htm>