Asynchronous thread pool, scheduling

Sun Aug 4 07:44:45 CEST 2002

I've been bitten by a problem with how threads in the asynchronous
thread pool are scheduled.  I've got a suggestion for how they ought
to be scheduled, how not to use them, and some corrections for the
"driver_entry" page in the ERTS Reference Manual.

The root of the scheduling problem, IMO, is that each thread has its
own work queue, instead of a single shared queue.  A work item is
scheduled along with a scheduling "key".  The value of that key modulo
the number of work threads determines which thread will execute that
item, and the item is added to that thread's work queue.  If there is
no key specified, the thread is chosen "round robin", and the item is
added to that thread's work queue.

The net effect of this round robin thread scheduling is that the
chosen thread may be busy doing something else, whereas there may be
other threads sitting idle.  If I cared about which thread executed my
work item, I would specify a key.  But if I don't care, I'd like to be
able to use "the first available thread".(*)

I'm playing with a driver with async work threads that block for 30
seconds or more.  It's disconcerting to have such a thread block, then
try running "debugger:start()" and have it fail (timeout) because a
"efile_drv" driver's work item is blocked because it's assigned to the
same thread that is working on the long-blocking item.(**) (***)

Perhaps a better answer is to have two thread pools: one for use by
items with a key specified, and another for "first available thread"
scheduling?  Or perhaps the answer is that driver writers will simply
need to manage their own private work threads in more situations?

The end of the ERTS Reference Manual (version 5.1.1) "erl_driver"
doc has some incorrect information in its description of
driver_async().  "A specific instance of the driver always uses the
same thread," is incorrect.  It correctly describes how the
"efile_drv" is currently implmented, but it is not generally true.
The value of driver_async()'s "key" argument determines whether or not
a subsequent work item may be executed by the same thread.

Any better ideas?

-Scott

(*) If I _really_ cared about what thread executed my work item, my
driver would maintain its own private thread(s).  However, it would
be nice to see the existing shared thread pool be as useful as
possible so that driver writers are forced to use their own private
thread(s) only when they have really weird threading constraints.

(**) An answer to this problem is the same as the answer to the
question "Doctor, it hurts when I do _this_": don't block for "long"
periods of time.  Unfortunately, it's difficult to get consensus on
what "long" means for all situations.

There is incentive to avoiding long-blocking threads: since you can't
simply kill a thread (on most platforms), you have to wait for it to
unblock before it can be told that it must shut itself down.  {sigh}
And managing the resource problems that can cause (whether in other
threads or in other OS processes) can cause headaches and loss of
sleep.  :-)

(***) FreeBSD 4.5 behaves even more badly than Linux does in this
situation.  The main thread gets blocked somehow: the process that
deals with the tty freezes.  Typed characters don't get echoed until
the work thread finishes, including Control-g.  This doesn't make much
sense, given how the async work thread stuff is supposed to work.  I
haven't found the root problem yet....