[erlang-patches] Fix efile_drv crash when using async thread pool
Filipe David Manana
Wed Jan 16 11:38:16 CET 2013
Thank you very much for the better suggestion and all the important
details of the async threads and file driver implementation.
I agree with your suggestion.
However the patch you've sent causes a memory leak.
The new task of type FILE_CLOSE_ON_PORT_EXIT is indeed executed,
closing the file descriptor.
However, because it's executed after the port is closed (file_close
invocation), the port's ready_async callback (file_async_ready) is not
invoked, therefore the switch case for FILE_CLOSE_ON_PORT_EXIT in
never run (see , and confirmed via debugging), which causes the
port data (file_descriptor struct) to never be freed.
Again, thank you :)
See my attached patch on this mail, works on top of the patch you sent
previously. Also here: https://gist.github.com/4546241
 - https://github.com/erlang/otp/blob/master/erts/emulator/beam/erl_async.c#L394
On Tue, Jan 15, 2013 at 4:16 PM, Patrik Nyblom <> wrote:
> Hi Filipe!
> On 01/11/2013 09:12 PM, Filipe David Manana wrote:
>> On Fri, Jan 11, 2013 at 8:00 PM, Björn-Egil Dahlberg <>
>>> On 2013-01-08 11:46, Filipe David Manana wrote:
>>>> Any reason for no feedback on this submission? It's quite critical
>>>> imho, it affects all file:open/2 calls that specify the 'compressed'
>>>> option, even for files that are not compressed. For example,
>>>> file_sorter always opens file with the 'compressed' option, and
>>>> shutting down a process that uses it, leads to these crashes when
>>>> using the async thread pool.
>>> Though your patch probably solves this (crucial) issue something is off.
>> Have been trying it in loops lasting several hours, without no more
>> crashes nor memory/fd leaks, for a few weeks now.
> Yes, this certainly fixes the problem - many thanks for both the excellent
> test case and for pointing out the serious flaw in the code!
>>> We have mechanisms in place to guard against exactly this but apparently
>>> guards are unsuccessful in this case, so therein lies a bug.
>> Which guards? In the allocator for the double gzFile close() call?
>> For the reads for example, there doesn't seem to be any guard at all.
> I mislead Egil into writing that :).
> What I was trying to say, was that the async file driver implementation
> relies on the fact that operations are carried out in order. This is
> maintained by simply giving all operations belonging to one single file to
> the same async thread, as you may have noticed in the code. However, when a
> port is closing, we need to guard the *write que* against deletion as it's
> coupled to the port, not the data sent to the async thread. If the port
> exits, an async operation can access the write que afterwards and - if no
> guard was there - *bang*. So, the port_data_lock guards against that, why
> write operations have to take the port data lock.
> As long as all operations on the file are queued to the async thread, the
> port data lock is all that's needed, as any close of the fd (regardless of
> if it's a gz-close or an ordinary close) will happen after the previously
> scheduled async operations on the file. Two async close operations cannot be
> scheduled as only one scheduler operates on a port at a time, so the simple
> setting of the FILE_FD_INVALID flag when scheduling a close should stop two
> close calls being scheduled. The only thing to be afraid of is data
> belonging to the port, not the file as such, why the port data lock should
> suffice to guard the writev que associated with the actual port.
> That was the supposed guards against simultaneous access to files etc
> So what's the problem then? The problem is that the port_exit callback,
> which is invoked when you kill the port, does *not* schedule the close! It
> executes it directly in the scheduler thread. Your approach to make sure
> that all operations on the file are run, fixes that. However, this means a
> scheduler thread potentially waiting for an NFS operation or something to
> finish in an async thread, so that the port can be closed. This is exactly
> the thing we do not want to happen unless absolutely necessary. That's after
> all one of the purposes of async thread - do not let the scheduler thread
> wait for lengthy file operations.
> My suggestion is that in order to solve this, we should rather schedule the
> close onto the async thread when the port get's an exit signal. I suppose
> the direct call is there to make the close more synchronous, but as we have
> to wait for the other operations anyway, doing the close in the async thread
> would give roughly the same result but without making the scheduler wait on
> a condition variable. Even though file operations on ordinary files actually
> is harmless after a close (which is probably why we havent spotted this) I
> think we should also schedule close for *all* files in this situation, there
> seems to be no real benefit from treating regular files differently in this
> case - a future port of the VM may also be to a platform where operations on
> closed regular files also gives problems. Besides that, avoiding a 'close'
> call in the scheduler thread would be a good thing.
> I've tested this and your test cases pass, but I would like your thoughts on
> this! You're obviously using files a lot and also have a firm grasp on the
> implementation and has recently worked on the problem, so your input would
> be extremely valuable!
> I also supplied a test patch, that replaces your solution with the above
> mentioned and should fix the same problem with this other approach of mine.
> It's based on the current master (with your other patch for leaks included)
> and has your testcase included. Note that this new approach will not fix the
> fd leak you fixed in the other patch, that patch is still needed.
> Furthermore the approach of scheduling the close upon an async thread does
> not work in that situation as the port is really dead there, we cannot
> schedule async jobs. To make that close (the one you added to avoid the fd
> leaks) happen in another thread than the scheduler thread would be nice, but
> I cannot find a viable solution there, so the close (which is safe as there
> can be no other async jobs for that file) has to happen in the scheduler
> thread if we do not add a special thread to do such things - leading to
> (even more) code bloat in efile_drv.
> Please note that the test patch is only smoke tested (I've run the complete
> kernel suite, but more will be run this noght as I'll put it to internal
> testing). Think of it more as a sketch than a real patch for now :)
> And once again, thanks for the extreamly valuable bug report and fix!
>>> I, or someone else from the VM team, needs to look closer at this.
>>> // Björn-Egil
>>>> On Thu, Dec 27, 2012 at 1:07 PM, Filipe David Manana
>>>>> git fetch git://github.com/fdmanana/otp.git fix_efile_drv_crash
>>>>> Filipe David Manana,
>>>>> "Reasonable men adapt themselves to the world.
>>>>> Unreasonable men adapt the world to themselves.
>>>>> That's why all progress depends on unreasonable men."
>>> erlang-patches mailing list
Filipe David Manana,
"Reasonable men adapt themselves to the world.
Unreasonable men adapt the world to themselves.
That's why all progress depends on unreasonable men."
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 3337 bytes
Desc: not available
More information about the erlang-patches