<div dir="ltr">I had a fruitful discussion with Fred on IRC. Fred pointed out that this works:<div><br></div><div><span style="font-family:arial,sans-serif;font-size:13px">$ echo "héllo<0001f603>" > foo</span><br style="font-family:arial,sans-serif;font-size:13px">
<span style="font-family:arial,sans-serif;font-size:13px">(which is $ echo "héllo😃" > foo with emoji, if the email client supports it)</span><br></div><div><br></div><div>$ erl</div><div><div>1> io:format("~w~n",[begin {ok, F} = file:open("foo", [{encoding,unicode}]), file:read_line(F) end]).</div>
<div>{ok,[104,233,108,108,111,128515,10]}</div></div><div><br></div><div>According to the docs of file:read_line/1, this is not supposed to happen:</div><div><br></div><div>"If encoding is set to something else than latin1, the read_line/1 call will fail if the data contains characters larger than 255, why the io(3) module is to be preferred when reading such a file."</div>
<div><br></div><div>If the IO device is meant to return all unicode codepoints as above, it means {get_line, Prompt} should translate to {get_line, IODeviceEncoding, Prompt} and we need to amend the I/O protocol to say so.</div>
<div><br></div><div>However, if the result is meant to be invalid, it means file:read_line/1 does an implicit conversion to latin1, since {get_line, Prompt} translates to {get_line, latin1, Prompt}. We could document it but I would rather disallow it by making the requests fail if the IO device encoding is not latin1.</div>
<div><br></div><div>I tried to sum up the possible solutions, in no particular order, to the best of my analysis:</div><div><br></div><div>1. Make it explicit that file:read_line/1 does a latin1 conversion. This means we need to fix the code to raise for codepoints > 255 when returning char lists (but the translation is a confusing behaviour imo)</div>
<div><br></div><div>2. Make it explicit that file:read_line/1 only works if the IO device is encoded in latin1. This means we need to change the code to fail for non latin1 encodings. The old I/O protocol messages would have to be amended to say (addition in bold):</div>
<div><br></div><div>These should behave as {get_until, latin1, Prompt, Module, Function, ExtraArgs}, {get_chars, latin1, Prompt, N} and {get_line, latin1, Prompt} respectively <b>if the device is encoded in latin1, otherwise it should return an error </b>(error to be specified).<br>
</div><div><br></div><div>3. Make file:read_line/1 return in the encoding of the I/O device. This means we need to change the code to translate {get_line, Prompt} to {get_line, IODeviceEncoding, Prompt}. However, this change implies the function is no longer byte-oriented. In any case, the old I/O protocol messages would have to be amended to say (addition in bold):</div>
<div><br></div><div>These should behave as {get_until, DeviceEncoding, Prompt, Module, Function, ExtraArgs}, {get_chars, DeviceEncoding, Prompt, N} and {get_line, DeviceEncoding, Prompt} respectively<b>, where DeviceEncoding is the encoding of the device.</b></div>
<div><br></div><div>4. Make file:read_line/1 always read bytes, regardless of the encoding. This is arguably the behaviour of those messages before unicode was added. This means however the devices would need to implement specific logic for those messages as they cannot simply translate {get_line, Prompt} to {get_line, latin1, Prompt}.</div>
<div><br></div><div>To me, the best solutions are 2 and 4, because they preserve the byte-oriented aspect. Regardless, it seems the implementation of file:read_line/1 has a bug. :)</div></div><div class="gmail_extra"><br clear="all">
<div><div><br></div><div><br></div><div><span style="font-size:13px"><div><span style="font-family:arial,sans-serif;font-size:13px;border-collapse:collapse"><b>José Valim</b></span></div><div><span style="font-family:arial,sans-serif;font-size:13px;border-collapse:collapse"><div>
<span style="font-family:verdana,sans-serif;font-size:x-small"><a href="http://www.plataformatec.com.br/" style="color:rgb(42,93,176)" target="_blank">www.plataformatec.com.br</a></span></div><div><span style="font-family:verdana,sans-serif;font-size:x-small">Skype: jv.ptec</span></div>
<div><span style="font-family:verdana,sans-serif;font-size:x-small">Founder and Lead Developer</span></div></span></div></span></div></div>
<br><br><div class="gmail_quote">On Fri, Jul 4, 2014 at 7:33 PM, Fred Hebert <span dir="ltr"><<a href="mailto:mononcqc@ferd.ca" target="_blank">mononcqc@ferd.ca</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="">On 07/04, José Valim wrote:<br>
> Hello everyone,<br>
><br>
> I have found the documentation or implementation file:read/2 to be<br>
> misleading when working with unicode devices in binary mode. I will use<br>
> file:read_line/1 in the examples below but the issue applies to<br>
> file:read/2, file:pread/1 and etc.<br>
><br>
> $ echo "héllo" > foo<br>
><br>
> $ erl<br>
> 1> {ok, F} = file:open("foo", [binary, unicode]).<br>
> {ok,<0.34.0>}<br>
> 2> {ok, Bin} = file:read_line(F).<br>
> {ok,<<"héllo\n">>}<br>
> 3> <<Bin/binary, 0>>.<br>
> <<104,233,108,108,111,10, 0>><br>
><br>
><br>
> Not the result is not the one desired because I expected a binary<br>
> containing <<"héllo\n"/utf8>>, or more explicitly, I expected it to contain<br>
> the bytes <<195, 169>> instead of <<233>>. In other words, I got latin1 as<br>
> result. With char lists, we would get "héllo\n" but the function will fail<br>
> for any codepoint > 255.<br>
><br>
<br>
</div>What you got isn't latin1:<br>
<br>
1> f(F), io:format("~w~n",[begin {ok, F} = file:open("foo", [binary]), file:read_line(F) end]).<br>
{ok,<<104,195,169,108,108,111,10>>}<br>
2> f(F), io:format("~w~n",[begin {ok, F} = file:open("foo", [binary, {encoding,latin1}]), file:read_line(F) end]).<br>
{ok,<<104,195,169,108,108,111,10>>}<br>
3> f(F), io:format("~p~n",[begin {ok, F} = file:open("foo", [binary, {encoding,latin1}]), file:read_line(F) end]).<br>
{ok,<<"héllo\n">>}<br>
4> f(F), io:format("~w~n",[begin {ok, F} = file:open("foo", [binary, {encoding,unicode}]), file:read_line(F) end]).<br>
{ok,<<104,233,108,108,111,10>>}<br>
5> f(F), io:format("~w~n",[begin {ok, F} = file:open("foo", [binary, {encoding,utf16}]), file:read_line(F) end]).<br>
{error,collect_line}<br>
<br>
You can see the latin1 reading does not modify the é (195,169) value, it<br>
interprets it as-is.<br>
<br>
On the other hand, the UTF-8 version converts it. 'é' can be represented<br>
both as U+00E9 (233) as a codepoint or UTF16 value. The appropriate<br>
Unicode representation for that one is indeed <<195,169>> (0xC3 0xA9),<br>
or alternatively <<101,204,129>> (if you used the e+combining ' form).<br>
<br>
So what you got I think isn't the latin1 result (because latin1<br>
interprets things as they are) -- what you got is the decoded codepoint<br>
that would usually be stuck in a list, and had it turned directly into a<br>
binary without the proper UTF8 encoding:<br>
<br>
6> unicode:characters_to_binary([104,233,108,108,111,10]).<br>
<<"héllo\n"/utf8>><br>
<br>
Here's an interesting one:<br>
<br>
$ echo "héllo<0001f603>" > foo<br>
(which is $ echo "héllo😃" > foo with emoji, if the email client supports it)<br>
<br>
1> f(F), io:format("~w~n",[begin {ok, F} = file:open("foo", [binary, {encoding,latin1}]), file:read_line(F) end]).<br>
{ok,<<104,195,169,108,108,111,240,159,152,131,10>>}<br>
ok<br>
2> f(F), io:format("~w~n",[begin {ok, F} = file:open("foo", [binary, {encoding,unicode}]), file:read_line(F) end]).<br>
{error,collect_line}<br>
<br>
I'd be ready to bet that the collect_line error comes from something a<br>
bit like a list_to_binary(Str) call on a list, assuming what we had were<br>
still a byte stream rather than a unicode string.<br>
<br>
<br>
Regards,<br>
Fred.<br>
</blockquote></div><br></div>