[erlang-bugs] file:read with read_ahead and binaries broken
Björn-Egil Dahlberg
egil@REDACTED
Thu Oct 29 14:25:24 CET 2009
Yes, I did hit send a bit prematurely.
The solution I was talking about does not solve this particular problem.
What's happening here is that the driver is keeping a read_ahead buffer
which is a binary of size 64 kB (if I remember the default cache size
correctly).
Each read will generate a subbinary of the read_ahead buffer which is
kept reachable in the process by pushing the subbinary to a list in the
read-loop.
Each file:position will flush the read_ahead cache and a new binary will
be made to take is place. *repeat until eof*
Each subbinary will reference the binary and force the gc to keep those
binaries since they are all live data. In this example the total memory
consumption would be roughly ~20M x 64K bytes / 2 ~ 640 GB which is not
the intention by the programmer I guess. =)
The main problem here is that each subbinary is kept. It is aggravated
by producing a new binary cache for each read. This is of course easily
remedied by matching numbers instead of binaries. In this case using
<<N:8>> instead of <<N:1/binary>>. Also instead of seeks one could read
2 bytes instead of one. Or, as you said, skip read_ahead since it wont
give any boost because of the seeks. I realize that this not the intent
of the test though.
Is this a bug in the handling of binaries?
No, but perhaps a limitation and not the "least astonishing result".
Users must be aware of the fact that subbinaries will keep the whole
binary it is referencing. And keeping the subbinaries reachable will
keep them from being gc:ed. In this case the user must also be aware of
the fact that he is receiving subbinaries from the reads. I think that
this could be clearer in the documentation.
One could argue that seeks should not always flush the cache. I fully
agree with you that this should be avoided. This is something we will
review.
One could also argue that subbinaries should be compacted. This is not
wise for the most common cases. It would kill performance and actually
bloat memory. A user can do this by himself by forcing a copy of the
subbinary. This will generate a new separate smaller binary.
Some sort of smart automatic compacting of binaries could be done in the
gc but it is not easily implemented for a number of reasons. Several
strategies for compacting are on the table but it wont be a realization
until R14 at the earliest.
I hope you find this information helpful.
*hitting send*
Regards,
Björn-Egil
Erlang/OTP
Matthew Sackman wrote:
> Hi Björn-Egil,
>
> Thanks for the reply, and good to know a solution is in the pipeline.
> However, you're solution is only addressing one issue. The other issue
> is why is a read issued when the position call does not move the file
> handle outside of the region currently cached by the read ahead buffer?
> In truth, both the seek and read libc calls can be avoided, or at the
> least, the position can be delayed until some other non-(position or
> read) call - eg truncate or write.
>
> Matthew
>
> ________________________________________________________________
> erlang-bugs mailing list. See http://www.erlang.org/faq.html
> erlang-bugs (at) erlang.org
More information about the erlang-bugs
mailing list