[erlang-bugs] Misleading docs or implementation of file:read/2 and friends
Fred Hebert
mononcqc@REDACTED
Fri Jul 4 19:33:10 CEST 2014
On 07/04, José Valim wrote:
> Hello everyone,
>
> I have found the documentation or implementation file:read/2 to be
> misleading when working with unicode devices in binary mode. I will use
> file:read_line/1 in the examples below but the issue applies to
> file:read/2, file:pread/1 and etc.
>
> $ echo "héllo" > foo
>
> $ erl
> 1> {ok, F} = file:open("foo", [binary, unicode]).
> {ok,<0.34.0>}
> 2> {ok, Bin} = file:read_line(F).
> {ok,<<"héllo\n">>}
> 3> <<Bin/binary, 0>>.
> <<104,233,108,108,111,10, 0>>
>
>
> Not the result is not the one desired because I expected a binary
> containing <<"héllo\n"/utf8>>, or more explicitly, I expected it to contain
> the bytes <<195, 169>> instead of <<233>>. In other words, I got latin1 as
> result. With char lists, we would get "héllo\n" but the function will fail
> for any codepoint > 255.
>
What you got isn't latin1:
1> f(F), io:format("~w~n",[begin {ok, F} = file:open("foo", [binary]), file:read_line(F) end]).
{ok,<<104,195,169,108,108,111,10>>}
2> f(F), io:format("~w~n",[begin {ok, F} = file:open("foo", [binary, {encoding,latin1}]), file:read_line(F) end]).
{ok,<<104,195,169,108,108,111,10>>}
3> f(F), io:format("~p~n",[begin {ok, F} = file:open("foo", [binary, {encoding,latin1}]), file:read_line(F) end]).
{ok,<<"héllo\n">>}
4> f(F), io:format("~w~n",[begin {ok, F} = file:open("foo", [binary, {encoding,unicode}]), file:read_line(F) end]).
{ok,<<104,233,108,108,111,10>>}
5> f(F), io:format("~w~n",[begin {ok, F} = file:open("foo", [binary, {encoding,utf16}]), file:read_line(F) end]).
{error,collect_line}
You can see the latin1 reading does not modify the é (195,169) value, it
interprets it as-is.
On the other hand, the UTF-8 version converts it. 'é' can be represented
both as U+00E9 (233) as a codepoint or UTF16 value. The appropriate
Unicode representation for that one is indeed <<195,169>> (0xC3 0xA9),
or alternatively <<101,204,129>> (if you used the e+combining ' form).
So what you got I think isn't the latin1 result (because latin1
interprets things as they are) -- what you got is the decoded codepoint
that would usually be stuck in a list, and had it turned directly into a
binary without the proper UTF8 encoding:
6> unicode:characters_to_binary([104,233,108,108,111,10]).
<<"héllo\n"/utf8>>
Here's an interesting one:
$ echo "héllo<0001f603>" > foo
(which is $ echo "héllo" > foo with emoji, if the email client supports it)
1> f(F), io:format("~w~n",[begin {ok, F} = file:open("foo", [binary, {encoding,latin1}]), file:read_line(F) end]).
{ok,<<104,195,169,108,108,111,240,159,152,131,10>>}
ok
2> f(F), io:format("~w~n",[begin {ok, F} = file:open("foo", [binary, {encoding,unicode}]), file:read_line(F) end]).
{error,collect_line}
I'd be ready to bet that the collect_line error comes from something a
bit like a list_to_binary(Str) call on a list, assuming what we had were
still a byte stream rather than a unicode string.
Regards,
Fred.
More information about the erlang-bugs
mailing list