[erlang-bugs] Misleading docs or implementation of file:read/2 and friends

Fri Jul 4 20:37:37 CEST 2014

I had a fruitful discussion with Fred on IRC. Fred pointed out that this
works:

$ echo "héllo<0001f603>" > foo
(which is $ echo "héllo��" > foo with emoji, if the email client supports
it)

$ erl
1> io:format("~w~n",[begin {ok, F} = file:open("foo",
[{encoding,unicode}]), file:read_line(F) end]).
{ok,[104,233,108,108,111,128515,10]}

According to the docs of file:read_line/1, this is not supposed to happen:

"If encoding is set to something else than latin1, the read_line/1 call
will fail if the data contains characters larger than 255, why the io(3)
module is to be preferred when reading such a file."

If the IO device is meant to return all unicode codepoints as above, it
means {get_line, Prompt} should translate to {get_line, IODeviceEncoding,
Prompt} and we need to amend the I/O protocol to say so.

However, if the result is meant to be invalid, it means file:read_line/1
does an implicit conversion to latin1, since {get_line, Prompt} translates
to {get_line, latin1, Prompt}. We could document it but I would rather
disallow it by making the requests fail if the IO device encoding is not
latin1.

I tried to sum up the possible solutions, in no particular order, to the
best of my analysis:

1. Make it explicit that file:read_line/1 does a latin1 conversion. This
means we need to fix the code to raise for codepoints > 255 when returning
char lists (but the translation is a confusing behaviour imo)

2. Make it explicit that file:read_line/1 only works if the IO device is
encoded in latin1. This means we need to change the code to fail for non
latin1 encodings. The old I/O protocol messages would have to be amended to
say (addition in bold):

These should behave as {get_until, latin1, Prompt, Module, Function,
ExtraArgs}, {get_chars, latin1, Prompt, N} and {get_line, latin1, Prompt}
respectively *if the device is encoded in latin1, otherwise it should
return an error *(error to be specified).

3. Make file:read_line/1 return in the encoding of the I/O device. This
means we need to change the code to translate {get_line, Prompt} to
{get_line, IODeviceEncoding, Prompt}. However, this change implies the
function is no longer byte-oriented. In any case, the old I/O protocol
messages would have to be amended to say (addition in bold):

These should behave as {get_until, DeviceEncoding, Prompt, Module,
Function, ExtraArgs}, {get_chars, DeviceEncoding, Prompt, N} and
{get_line, DeviceEncoding, Prompt} respectively*, where DeviceEncoding is
the encoding of the device.*

4. Make file:read_line/1 always read bytes, regardless of the encoding.
This is arguably the behaviour of those messages before unicode was added.
This means however the devices would need to implement specific logic for
those messages as they cannot simply translate {get_line, Prompt} to
{get_line, latin1, Prompt}.

To me, the best solutions are 2 and 4, because they preserve the
byte-oriented aspect. Regardless, it seems the implementation of
file:read_line/1 has a bug. :)

*José Valim*
www.plataformatec.com.br
Skype: jv.ptec
Founder and Lead Developer

On Fri, Jul 4, 2014 at 7:33 PM, Fred Hebert <mononcqc@REDACTED> wrote:

> On 07/04, José Valim wrote:
> > Hello everyone,
> >
> > I have found the documentation or implementation file:read/2 to be
> > misleading when working with unicode devices in binary mode. I will use
> > file:read_line/1 in the examples below but the issue applies to
> > file:read/2, file:pread/1 and etc.
> >
> > $ echo "héllo" > foo
> >
> > $ erl
> > 1> {ok, F} = file:open("foo", [binary, unicode]).
> > {ok,<0.34.0>}
> > 2> {ok, Bin} = file:read_line(F).
> > {ok,<<"héllo\n">>}
> > 3> <<Bin/binary, 0>>.
> > <<104,233,108,108,111,10, 0>>
> >
> >
> > Not the result is not the one desired because I expected a binary
> > containing <<"héllo\n"/utf8>>, or more explicitly, I expected it to
> contain
> > the bytes <<195, 169>> instead of <<233>>. In other words, I got latin1
> as
> > result. With char lists, we would get "héllo\n" but the function will
> fail
> > for any codepoint > 255.
> >
>
> What you got isn't latin1:
>
> 1> f(F),  io:format("~w~n",[begin {ok, F} = file:open("foo", [binary]),
> file:read_line(F) end]).
> {ok,<<104,195,169,108,108,111,10>>}
> 2> f(F),  io:format("~w~n",[begin {ok, F} = file:open("foo", [binary,
> {encoding,latin1}]), file:read_line(F) end]).
> {ok,<<104,195,169,108,108,111,10>>}
> 3> f(F),  io:format("~p~n",[begin {ok, F} = file:open("foo", [binary,
> {encoding,latin1}]), file:read_line(F) end]).
> {ok,<<"hÃ©llo\n">>}
> 4> f(F),  io:format("~w~n",[begin {ok, F} = file:open("foo", [binary,
> {encoding,unicode}]), file:read_line(F) end]).
> {ok,<<104,233,108,108,111,10>>}
> 5> f(F),  io:format("~w~n",[begin {ok, F} = file:open("foo", [binary,
> {encoding,utf16}]), file:read_line(F) end]).
> {error,collect_line}
>
> You can see the latin1 reading does not modify the é (195,169) value, it
> interprets it as-is.
>
> On the other hand, the UTF-8 version converts it. 'é' can be represented
> both as U+00E9 (233) as a codepoint or UTF16 value. The appropriate
> Unicode representation for that one is indeed <<195,169>> (0xC3 0xA9),
> or alternatively <<101,204,129>> (if you used the e+combining ' form).
>
> So what you got I think isn't the latin1 result (because latin1
> interprets things as they are) -- what you got is the decoded codepoint
> that would usually be stuck in a list, and had it turned directly into a
> binary without the proper UTF8 encoding:
>
> 6> unicode:characters_to_binary([104,233,108,108,111,10]).
> <<"héllo\n"/utf8>>
>
> Here's an interesting one:
>
> $ echo "héllo<0001f603>" > foo
> (which is $ echo "héllo��" > foo with emoji, if the email client supports
> it)
>
> 1> f(F),  io:format("~w~n",[begin {ok, F} = file:open("foo", [binary,
> {encoding,latin1}]), file:read_line(F) end]).
> {ok,<<104,195,169,108,108,111,240,159,152,131,10>>}
> ok
> 2> f(F),  io:format("~w~n",[begin {ok, F} = file:open("foo", [binary,
> {encoding,unicode}]), file:read_line(F) end]).
> {error,collect_line}
>
> I'd be ready to bet that the collect_line error comes from something a
> bit like a list_to_binary(Str) call on a list, assuming what we had were
> still a byte stream rather than a unicode string.
>
>
> Regards,
> Fred.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20140704/73ad4e6b/attachment.htm>