[erlang-bugs] file:read with read_ahead and binaries broken

Björn-Egil Dahlberg egil@REDACTED
Thu Oct 29 14:25:24 CET 2009


Yes, I did hit send a bit prematurely.

The solution I was talking about does not solve this particular problem.

What's happening here is that the driver is keeping a read_ahead buffer 
which is a binary of size 64 kB (if I remember the default cache size 
correctly).

Each read will generate a subbinary of the read_ahead buffer which is 
kept reachable in the process by pushing the subbinary to a list in the 
read-loop.

Each file:position will flush the read_ahead cache and a new binary will 
be made to take is place. *repeat until eof*

Each subbinary will reference the binary and force the gc to keep those 
binaries since they are all live data. In this example the total memory 
consumption would be roughly ~20M x 64K bytes / 2 ~ 640 GB which is not 
the intention by the programmer I guess. =)

The main problem here is that each subbinary is kept. It is aggravated 
by producing a new binary cache for each read. This is of course easily 
remedied by matching numbers instead of binaries. In this case using 
<<N:8>> instead of <<N:1/binary>>. Also instead of seeks one could read 
2 bytes instead of one. Or, as you said, skip read_ahead since it wont 
give any boost because of the seeks. I realize that this not the intent 
of the test though.

Is this a bug in the handling of binaries?
No, but perhaps a limitation and not the "least astonishing result". 
Users must be aware of the fact that subbinaries will keep the whole 
binary it is referencing. And keeping the subbinaries reachable will 
keep them from being gc:ed. In this case the user must also be aware of 
the fact that he is receiving subbinaries from the reads. I think that 
this could be clearer in the documentation.

One could argue that seeks should not always flush the cache. I fully 
agree with you that this should be avoided. This is something we will 
review.

One could also argue that subbinaries should be compacted. This is not 
wise for the most common cases. It would kill performance and actually 
bloat memory. A user can do this by himself by forcing a copy of the 
subbinary. This will generate a new separate smaller binary.

Some sort of smart automatic compacting of binaries could be done in the 
gc but it is not easily implemented for a number of reasons. Several 
strategies for compacting are on the table but it wont be a realization 
until R14 at the earliest.

I hope you find this information helpful.

*hitting send*

Regards,
Björn-Egil
Erlang/OTP


Matthew Sackman wrote:
> Hi Björn-Egil,
> 
> Thanks for the reply, and good to know a solution is in the pipeline.
> However, you're solution is only addressing one issue. The other issue
> is why is a read issued when the position call does not move the file
> handle outside of the region currently cached by the read ahead buffer?
> In truth, both the seek and read libc calls can be avoided, or at the
> least, the position can be delayed until some other non-(position or
> read) call - eg truncate or write.
> 
> Matthew
> 
> ________________________________________________________________
> erlang-bugs mailing list. See http://www.erlang.org/faq.html
> erlang-bugs (at) erlang.org




More information about the erlang-bugs mailing list