[erlang-bugs] Misleading docs or implementation of file:read/2 and friends

Tue Mar 3 17:15:03 CET 2015

I took at file:read_line issue, and thought about it :-)

IMHO it should behave as file:read/2 does and how it is documented.
i.e. return {error, {no_translation, unicode, latin1}} for both binary and
list mode
when encoding is set to unicode and reading non latin1 code points.
(Which is confusing but that is how it is intended/documented,
 use io module for translating non latin1 codepoints).

Also file:pread/[2,3], shouldn't they also behave the same as file:read/2
or am I missing something?

/Dan

On Thu, Jan 29, 2015 at 11:04 AM, Vlad Dumitrescu <vladdu55@REDACTED>
wrote:

> Hi,
>
> I agree that it is difficult to know which functions in 'file' or 'io' to
> use and if the file should be opened as unicode or not. Without reading the
> docs in all detail, it's easy to assume that since file:open takes an
> encoding option, the rest of the functions in the module can handle all
> encodings.
>
> The docs say
>
> The rule of thumb is that the file module should be used for files opened
> for bytewise access ({encoding,latin1}) and the io module should be used
> when accessing files with any other encoding (e.g. {encoding,uf8}).
>
>
> and using io:get_line/2 works, but maybe the very first step could be to
> make the note above stand out more proeminently in the docs?
>
> Also, a simple step that can be done right away is to let file:read_line
> return the same error message as firle: read:
> {error,{no_translation,unicode,latin1}} which describes exactly what the
> problem is. {error, collect_line} isn't even a documented error message.
>
> best regards,
> Vlad
>
>
> On Thu, Jan 29, 2015 at 9:14 AM, José Valim <
> jose.valim@REDACTED> wrote:
>
>> Pinging this thread for feedback.
>>
>>
>>
>> *José Valim*
>> www.plataformatec.com.br
>> Skype: jv.ptec
>> Founder and Lead Developer
>>
>> On Fri, Jul 4, 2014 at 5:10 PM, José Valim <
>> jose.valim@REDACTED> wrote:
>>
>>> Hello everyone,
>>>
>>> I have found the documentation or implementation file:read/2 to be
>>> misleading when working with unicode devices in binary mode. I will use
>>> file:read_line/1 in the examples below but the issue applies to
>>> file:read/2, file:pread/1 and etc.
>>>
>>> $ echo "héllo" > foo
>>>
>>> $ erl
>>> 1> {ok, F} = file:open("foo", [binary, unicode]).
>>> {ok,<0.34.0>}
>>> 2> {ok, Bin} = file:read_line(F).
>>> {ok,<<"héllo\n">>}
>>> 3> <<Bin/binary, 0>>.
>>> <<104,233,108,108,111,10, 0>>
>>>
>>>
>>> Not the result is not the one desired because I expected a binary
>>> containing <<"héllo\n"/utf8>>, or more explicitly, I expected it to contain
>>> the bytes <<195, 169>> instead of <<233>>. In other words, I got latin1 as
>>> result. With char lists, we would get "héllo\n" but the function will fail
>>> for any codepoint > 255.
>>>
>>> Note this behaviour also happens if I use file:read_line/1 on any other
>>> IO device set to unicode (like standard_io).
>>>
>>> The trouble I have with the function is that it is aimed to
>>> byte-oriented but it only really works for latin1 files. If you have a
>>> unicode file, the behaviour seems to be broken for binaries, and it only
>>> works for a limited range of codepoints when using char lists.
>>>
>>> It is interesting to notice those functions use the old {get_line,
>>> Prompt} messages which, according to the I/O protocol guide
>>> <http://www.erlang.org/doc/apps/stdlib/io_protocol.html>, should not
>>> exist beyond R15B (section 1.3). The same section above suggests to
>>> translate {get_line, Prompt} to {get_line, latin1, Prompt} which seems to
>>> be the root cause of the issues above: those functions were never meant to
>>> work with unicode devices.
>>>
>>> Unless I am terribly wrong, I can think of some ways to fix those
>>> functions:
>>>
>>> 1. Keep its dual aspect of returning bytes and/or characters but fix the
>>> bug when working with unicode. This means the {get_line, Prompt} should
>>> rather translate to {get_line, EncodingOfTheIODevice, Prompt}
>>>
>>> 2. Make them crash if the encoding of the device is not latin1. This
>>> means we translate {get_line, Prompt} to {get_line, latin1, Prompt} only if
>>> the encoding of the device is latin1.
>>>
>>> 3. Make it always work at the byte level, regardless of the encoding of
>>> the IO device. This would require assigning new meaning to the {get_line,
>>> Prompt} message, so I believe it is not going to happen (although it would
>>> be a useful feature in my opinion).
>>>
>>> Given that the original IO messages were never meant to work with
>>> unicode, maybe 2) is the best way to go here. Both 1) and 2) would require
>>> a small amendment to the current I/O protocol advice but I would argue it
>>> is necessary to fix the current limitations/bugs when working with unicode.
>>>
>>> *José Valim*
>>> www.plataformatec.com.br
>>> Skype: jv.ptec
>>> Founder and Lead Developer
>>>
>>
>>
>> _______________________________________________
>> erlang-bugs mailing list
>> erlang-bugs@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-bugs
>>
>>
>
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://erlang.org/mailman/listinfo/erlang-bugs
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20150303/60d47ea1/attachment.htm>