[erlang-bugs] file:read with read_ahead and binaries broken

Björn-Egil Dahlberg <>
Wed Oct 28 13:50:48 CET 2009


Hi Matthew,

We are aware of this issue and a more aggressive gc-strategy is being 
developed. This will be in place in the next release unless something 
unforeseen happens.

The new strategy involves virtual heaps for binaries that will also 
trigger gc:s when binary heap boundaries are reached instead of only 
procbins and binary overhead counting triggers.

The new strategy will also take care of past old heap binary problems.

Regards,
Björn-Egil
Erlang/OTP

Matthew Sackman wrote:
> dd if=/dev/urandom of=/tmp/file.rnd bs=1M count=20
> 
> test(Hdl) ->
>     test(Hdl, []).
> 
> test(Hdl, Acc) ->
>     case file:read(Hdl, 1) of
>         {ok, <<Num:1/binary>>} -> {ok, _Pos} = file:position(Hdl, {cur, 1}),
>                                   test(Hdl, [Num|Acc]);
>         eof -> Acc
>     end.
> 
> 1> f(), {ok, Hdl} = file:open("/tmp/file.rnd", [read, read_ahead, binary, raw]),
>   X = test:test(Hdl), ok = file:close(Hdl).
> 
> Erlang will die. Badly. erlang:memory() shows that of the 4GB erlang
> has claimed before I kill it, 3.9GB of that is binary data.
> 
> Ways to stop this going nuts:
> 1) Don't use read_ahead
> 2) Remove the position call - instead, read 2 bytes and skip the second
> 3) Add any random term, say 'foo' to the Acc, rather than Num.
> 4) Have Num as an int, not a binary.
> 5) Do the following:
>         {ok, <<Num:8>>} -> {ok, _Pos} = file:position(Hdl, {cur, 1}),
>                            <<Num2:1/binary>> = <<Num:8>>,
>                            test(Hdl, [Num2|Acc]);
> 
> My guess is that what's happening is that the read is reading in a whole
> disk page (as it should), Num is a pointer into the start of that page,
> but the rest of the page beyond the first byte, isn't reclaimed. Then the
> position seemingly invalidates the entire page. This is confirmed by the
> fact that strace -f -c -p $PID shows the same number of calls to read in
> both the read_ahead and non read_ahead versions. Interestingly though,
> there are twice as many calls to lseek in the read_ahead version.
> 
>>From inspecting the size of the file itself, both the read_ahead and non
> versions are really issuing a read for every single byte read, and the
> read_ahead version also has the advantage of issuing twice as many
> seeks.
> 
> A quick test shows this happens at least as far back as R12B5, and still
> happens in R13B02.
> 
> Oh and if you follow suggestion (5), you'll find the read_ahead version
> is about 8 times slower than the non read_ahead version.
> 
> Matthew
> 
> ________________________________________________________________
> erlang-bugs mailing list. See http://www.erlang.org/faq.html
> erlang-bugs (at) erlang.org




More information about the erlang-bugs mailing list