[erlang-bugs] file:read with read_ahead and binaries broken

Björn-Egil Dahlberg <>
Wed Oct 28 13:50:48 CET 2009

Hi Matthew,

We are aware of this issue and a more aggressive gc-strategy is being 
developed. This will be in place in the next release unless something 
unforeseen happens.

The new strategy involves virtual heaps for binaries that will also 
trigger gc:s when binary heap boundaries are reached instead of only 
procbins and binary overhead counting triggers.

The new strategy will also take care of past old heap binary problems.


Matthew Sackman wrote:
> dd if=/dev/urandom of=/tmp/file.rnd bs=1M count=20
> test(Hdl) ->
>     test(Hdl, []).
> test(Hdl, Acc) ->
>     case file:read(Hdl, 1) of
>         {ok, <<Num:1/binary>>} -> {ok, _Pos} = file:position(Hdl, {cur, 1}),
>                                   test(Hdl, [Num|Acc]);
>         eof -> Acc
>     end.
> 1> f(), {ok, Hdl} = file:open("/tmp/file.rnd", [read, read_ahead, binary, raw]),
>   X = test:test(Hdl), ok = file:close(Hdl).
> Erlang will die. Badly. erlang:memory() shows that of the 4GB erlang
> has claimed before I kill it, 3.9GB of that is binary data.
> Ways to stop this going nuts:
> 1) Don't use read_ahead
> 2) Remove the position call - instead, read 2 bytes and skip the second
> 3) Add any random term, say 'foo' to the Acc, rather than Num.
> 4) Have Num as an int, not a binary.
> 5) Do the following:
>         {ok, <<Num:8>>} -> {ok, _Pos} = file:position(Hdl, {cur, 1}),
>                            <<Num2:1/binary>> = <<Num:8>>,
>                            test(Hdl, [Num2|Acc]);
> My guess is that what's happening is that the read is reading in a whole
> disk page (as it should), Num is a pointer into the start of that page,
> but the rest of the page beyond the first byte, isn't reclaimed. Then the
> position seemingly invalidates the entire page. This is confirmed by the
> fact that strace -f -c -p $PID shows the same number of calls to read in
> both the read_ahead and non read_ahead versions. Interestingly though,
> there are twice as many calls to lseek in the read_ahead version.
>>From inspecting the size of the file itself, both the read_ahead and non
> versions are really issuing a read for every single byte read, and the
> read_ahead version also has the advantage of issuing twice as many
> seeks.
> A quick test shows this happens at least as far back as R12B5, and still
> happens in R13B02.
> Oh and if you follow suggestion (5), you'll find the read_ahead version
> is about 8 times slower than the non read_ahead version.
> Matthew
> ________________________________________________________________
> erlang-bugs mailing list. See http://www.erlang.org/faq.html
> erlang-bugs (at) erlang.org

More information about the erlang-bugs mailing list