[erlang-bugs] Misleading docs or implementation of file:read/2 and friends

Fri Jul 4 20:56:38 CEST 2014

On 07/04, José Valim wrote:
> I had a fruitful discussion with Fred on IRC. Fred pointed out that this
> works:
> 
> $ echo "héllo<0001f603>" > foo
> (which is $ echo "héllo��" > foo with emoji, if the email client supports
> it)
> 
> $ erl
> 1> io:format("~w~n",[begin {ok, F} = file:open("foo",
> [{encoding,unicode}]), file:read_line(F) end]).
> {ok,[104,233,108,108,111,128515,10]}
> 
> According to the docs of file:read_line/1, this is not supposed to happen:
> 
> "If encoding is set to something else than latin1, the read_line/1 call
> will fail if the data contains characters larger than 255, why the io(3)
> module is to be preferred when reading such a file."
> 
> If the IO device is meant to return all unicode codepoints as above, it
> means {get_line, Prompt} should translate to {get_line, IODeviceEncoding,
> Prompt} and we need to amend the I/O protocol to say so.
> 

This example case not only shows the bug mentioned by José here,

1> f(F),  io:format("~w~n",[begin {ok, F} = file:open("foo", [{encoding,unicode}]), file:read_line(F) end]).
{ok,[104,233,108,108,111,128515,10]}
ok
2> f(F),  io:format("~w~n",[begin {ok, F} = file:open("foo", [binary, {encoding,unicode}]), file:read_line(F) end]).
{error,collect_line}
ok

In this case, what I suspect goes on (without looking at the source) is
that the conversion to a binary is done only *after* everything has been
read as unicode, and a bland 'list_to_binary/1' is applied where a
'characters:unicode_to_binary/1' would have been appropriate, if it is
decided that returning bytes > 255 is to be supported.

Regards,
Fred.