[erlang-bugs] Misleading docs or implementation of file:read/2 and friends

Thu Jan 29 09:14:46 CET 2015

Pinging this thread for feedback.

*José Valim*
www.plataformatec.com.br
Skype: jv.ptec
Founder and Lead Developer

On Fri, Jul 4, 2014 at 5:10 PM, José Valim <jose.valim@REDACTED>
wrote:

> Hello everyone,
>
> I have found the documentation or implementation file:read/2 to be
> misleading when working with unicode devices in binary mode. I will use
> file:read_line/1 in the examples below but the issue applies to
> file:read/2, file:pread/1 and etc.
>
> $ echo "héllo" > foo
>
> $ erl
> 1> {ok, F} = file:open("foo", [binary, unicode]).
> {ok,<0.34.0>}
> 2> {ok, Bin} = file:read_line(F).
> {ok,<<"héllo\n">>}
> 3> <<Bin/binary, 0>>.
> <<104,233,108,108,111,10, 0>>
>
>
> Not the result is not the one desired because I expected a binary
> containing <<"héllo\n"/utf8>>, or more explicitly, I expected it to contain
> the bytes <<195, 169>> instead of <<233>>. In other words, I got latin1 as
> result. With char lists, we would get "héllo\n" but the function will fail
> for any codepoint > 255.
>
> Note this behaviour also happens if I use file:read_line/1 on any other IO
> device set to unicode (like standard_io).
>
> The trouble I have with the function is that it is aimed to byte-oriented
> but it only really works for latin1 files. If you have a unicode file, the
> behaviour seems to be broken for binaries, and it only works for a limited
> range of codepoints when using char lists.
>
> It is interesting to notice those functions use the old {get_line, Prompt}
> messages which, according to the I/O protocol guide
> <http://www.erlang.org/doc/apps/stdlib/io_protocol.html>, should not
> exist beyond R15B (section 1.3). The same section above suggests to
> translate {get_line, Prompt} to {get_line, latin1, Prompt} which seems to
> be the root cause of the issues above: those functions were never meant to
> work with unicode devices.
>
> Unless I am terribly wrong, I can think of some ways to fix those
> functions:
>
> 1. Keep its dual aspect of returning bytes and/or characters but fix the
> bug when working with unicode. This means the {get_line, Prompt} should
> rather translate to {get_line, EncodingOfTheIODevice, Prompt}
>
> 2. Make them crash if the encoding of the device is not latin1. This means
> we translate {get_line, Prompt} to {get_line, latin1, Prompt} only if the
> encoding of the device is latin1.
>
> 3. Make it always work at the byte level, regardless of the encoding of
> the IO device. This would require assigning new meaning to the {get_line,
> Prompt} message, so I believe it is not going to happen (although it would
> be a useful feature in my opinion).
>
> Given that the original IO messages were never meant to work with unicode,
> maybe 2) is the best way to go here. Both 1) and 2) would require a small
> amendment to the current I/O protocol advice but I would argue it is
> necessary to fix the current limitations/bugs when working with unicode.
>
> *José Valim*
> www.plataformatec.com.br
> Skype: jv.ptec
> Founder and Lead Developer
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20150129/7aca8c23/attachment.htm>