[erlang-bugs] Misleading docs or implementation of file:read/2 and friends

Fri Jul 4 19:33:10 CEST 2014

On 07/04, José Valim wrote:
> Hello everyone,
> 
> I have found the documentation or implementation file:read/2 to be
> misleading when working with unicode devices in binary mode. I will use
> file:read_line/1 in the examples below but the issue applies to
> file:read/2, file:pread/1 and etc.
> 
> $ echo "héllo" > foo
> 
> $ erl
> 1> {ok, F} = file:open("foo", [binary, unicode]).
> {ok,<0.34.0>}
> 2> {ok, Bin} = file:read_line(F).
> {ok,<<"héllo\n">>}
> 3> <<Bin/binary, 0>>.
> <<104,233,108,108,111,10, 0>>
> 
> 
> Not the result is not the one desired because I expected a binary
> containing <<"héllo\n"/utf8>>, or more explicitly, I expected it to contain
> the bytes <<195, 169>> instead of <<233>>. In other words, I got latin1 as
> result. With char lists, we would get "héllo\n" but the function will fail
> for any codepoint > 255.
> 

What you got isn't latin1:

1> f(F),  io:format("~w~n",[begin {ok, F} = file:open("foo", [binary]), file:read_line(F) end]).
{ok,<<104,195,169,108,108,111,10>>}
2> f(F),  io:format("~w~n",[begin {ok, F} = file:open("foo", [binary, {encoding,latin1}]), file:read_line(F) end]).
{ok,<<104,195,169,108,108,111,10>>}
3> f(F),  io:format("~p~n",[begin {ok, F} = file:open("foo", [binary, {encoding,latin1}]), file:read_line(F) end]).
{ok,<<"hÃ©llo\n">>}
4> f(F),  io:format("~w~n",[begin {ok, F} = file:open("foo", [binary, {encoding,unicode}]), file:read_line(F) end]).
{ok,<<104,233,108,108,111,10>>}
5> f(F),  io:format("~w~n",[begin {ok, F} = file:open("foo", [binary, {encoding,utf16}]), file:read_line(F) end]).
{error,collect_line}

You can see the latin1 reading does not modify the é (195,169) value, it
interprets it as-is.

On the other hand, the UTF-8 version converts it. 'é' can be represented
both as U+00E9 (233) as a codepoint or UTF16 value. The appropriate
Unicode representation for that one is indeed <<195,169>> (0xC3 0xA9),
or alternatively <<101,204,129>> (if you used the e+combining ' form).

So what you got I think isn't the latin1 result (because latin1
interprets things as they are) -- what you got is the decoded codepoint
that would usually be stuck in a list, and had it turned directly into a
binary without the proper UTF8 encoding:

6> unicode:characters_to_binary([104,233,108,108,111,10]).
<<"héllo\n"/utf8>>

Here's an interesting one:

$ echo "héllo<0001f603>" > foo
(which is $ echo "héllo��" > foo with emoji, if the email client supports it)

1> f(F),  io:format("~w~n",[begin {ok, F} = file:open("foo", [binary, {encoding,latin1}]), file:read_line(F) end]).
{ok,<<104,195,169,108,108,111,240,159,152,131,10>>}
ok
2> f(F),  io:format("~w~n",[begin {ok, F} = file:open("foo", [binary, {encoding,unicode}]), file:read_line(F) end]).
{error,collect_line}

I'd be ready to bet that the collect_line error comes from something a
bit like a list_to_binary(Str) call on a list, assuming what we had were
still a byte stream rather than a unicode string.

Regards,
Fred.