[erlang-bugs] Misleading docs or implementation of file:read/2 and friends

Vlad Dumitrescu vladdu55@REDACTED
Thu Jan 29 11:04:06 CET 2015


Hi,

I agree that it is difficult to know which functions in 'file' or 'io' to
use and if the file should be opened as unicode or not. Without reading the
docs in all detail, it's easy to assume that since file:open takes an
encoding option, the rest of the functions in the module can handle all
encodings.

The docs say

The rule of thumb is that the file module should be used for files opened
for bytewise access ({encoding,latin1}) and the io module should be used
when accessing files with any other encoding (e.g. {encoding,uf8}).


and using io:get_line/2 works, but maybe the very first step could be to
make the note above stand out more proeminently in the docs?

Also, a simple step that can be done right away is to let file:read_line
return the same error message as firle: read:
{error,{no_translation,unicode,latin1}} which describes exactly what the
problem is. {error, collect_line} isn't even a documented error message.

best regards,
Vlad


On Thu, Jan 29, 2015 at 9:14 AM, José Valim <jose.valim@REDACTED
> wrote:

> Pinging this thread for feedback.
>
>
>
> *José Valim*
> www.plataformatec.com.br
> Skype: jv.ptec
> Founder and Lead Developer
>
> On Fri, Jul 4, 2014 at 5:10 PM, José Valim <
> jose.valim@REDACTED> wrote:
>
>> Hello everyone,
>>
>> I have found the documentation or implementation file:read/2 to be
>> misleading when working with unicode devices in binary mode. I will use
>> file:read_line/1 in the examples below but the issue applies to
>> file:read/2, file:pread/1 and etc.
>>
>> $ echo "héllo" > foo
>>
>> $ erl
>> 1> {ok, F} = file:open("foo", [binary, unicode]).
>> {ok,<0.34.0>}
>> 2> {ok, Bin} = file:read_line(F).
>> {ok,<<"héllo\n">>}
>> 3> <<Bin/binary, 0>>.
>> <<104,233,108,108,111,10, 0>>
>>
>>
>> Not the result is not the one desired because I expected a binary
>> containing <<"héllo\n"/utf8>>, or more explicitly, I expected it to contain
>> the bytes <<195, 169>> instead of <<233>>. In other words, I got latin1 as
>> result. With char lists, we would get "héllo\n" but the function will fail
>> for any codepoint > 255.
>>
>> Note this behaviour also happens if I use file:read_line/1 on any other
>> IO device set to unicode (like standard_io).
>>
>> The trouble I have with the function is that it is aimed to byte-oriented
>> but it only really works for latin1 files. If you have a unicode file, the
>> behaviour seems to be broken for binaries, and it only works for a limited
>> range of codepoints when using char lists.
>>
>> It is interesting to notice those functions use the old {get_line,
>> Prompt} messages which, according to the I/O protocol guide
>> <http://www.erlang.org/doc/apps/stdlib/io_protocol.html>, should not
>> exist beyond R15B (section 1.3). The same section above suggests to
>> translate {get_line, Prompt} to {get_line, latin1, Prompt} which seems to
>> be the root cause of the issues above: those functions were never meant to
>> work with unicode devices.
>>
>> Unless I am terribly wrong, I can think of some ways to fix those
>> functions:
>>
>> 1. Keep its dual aspect of returning bytes and/or characters but fix the
>> bug when working with unicode. This means the {get_line, Prompt} should
>> rather translate to {get_line, EncodingOfTheIODevice, Prompt}
>>
>> 2. Make them crash if the encoding of the device is not latin1. This
>> means we translate {get_line, Prompt} to {get_line, latin1, Prompt} only if
>> the encoding of the device is latin1.
>>
>> 3. Make it always work at the byte level, regardless of the encoding of
>> the IO device. This would require assigning new meaning to the {get_line,
>> Prompt} message, so I believe it is not going to happen (although it would
>> be a useful feature in my opinion).
>>
>> Given that the original IO messages were never meant to work with
>> unicode, maybe 2) is the best way to go here. Both 1) and 2) would require
>> a small amendment to the current I/O protocol advice but I would argue it
>> is necessary to fix the current limitations/bugs when working with unicode.
>>
>> *José Valim*
>> www.plataformatec.com.br
>> Skype: jv.ptec
>> Founder and Lead Developer
>>
>
>
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://erlang.org/mailman/listinfo/erlang-bugs
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20150129/fda61755/attachment.htm>


More information about the erlang-bugs mailing list