Interesting benchmark performance (was RE: Pitiful benchmark perf ormance)

Mon Jun 18 14:50:37 CEST 2001

> Maybe the time has come to adaptively adjust the number of reductions
> before yielding (ie, rescheduling)? 

Perhaps :)

> Here is a portable, straightforward approach: instead of compiling in
> a constant number of reductions, the number of remaining reductions
> should be loaded from the process structure when checking for
> yielding.

As far as I can make out the current behaviour based on number of reductions
generally works very well in the situation where each time the system is
called upon to do some work there are at least 2000 to be done. One possible
improvement (which I mailed before but it got lost) could be:

<suggestion>

The main emulator loop appears to be the code at the very end of sys.c. This
calls schedule() and waits for it to return before any i/o is checked.

schedule() in erl_process.c will *only* return when INPUT_REDUCTIONS have
all been used up.

As far as I can make out the only reason the echo/server benchmark doesn't
deadlock is that there is enough going on in the background in an otherwise
quiescent system to eventually use up the 2000 function calls required to
check i/o.

Maybe schedule() might be modified so that in addition to returning after
INPUT_REDUCTIONS it keeps track of activity in each pass through all
processes and returns if there has been no activity detected for a whole
pass. This should ensure responsiveness is not compromised while not adding
significantly to the context switching load.

<suggestion/>

Since then I have tried to make the echo server benchmark run faster by
adjusting the INPUT_REDUCTIONS and CONTEXT_REDS values and recompiling the
emulator with absolutely no success. The only thing I have found to work is
running a busy loop in the background!!!

> The interesting part, then, is deciding how many reductions you get
> when you're scheduled. A simple approach is to permit the system to
> set the reductions-per-yield at runtime (per process or for the entire
> node), by using a BIF. But this must be supplemented by some way to
> measure activity, so that the decision can be made systematically.
> (Alternatively, one could take the approach that reductions-per-yield
> is set _only_ inside the runtime, to avoid messing around with BIFs.)

The simplest method could be to make it a beam startup parameter rather than
a constant. Then for weird applications the value could easily be tuned for
max performance.

> A second, orthogonal, topic to consider is how well a "reduction"
> corresponds to a time tick. A reduction can vary quite a bit in the
> amount of time it requires, because of BIFs: today, there are ways to
> bump reductions when a long-running BIF begins. 
> 
> Another approach to yielding might be to measure the available time
> slice in hardware cycles, rather than procedure calls. All desktop
> processors have cycle counters, for example, so it is viable for a
> wide range of systems. Unfortunately, the counters are often somewhat
> messy to work with.

It could be an interesting excercise. There appears to have been a large
amout of effort put into all the BIFs to try to make them as fair as
possible. Possibly this is heavier than making them all interruptible based
on a time slice??

Still perplexed

- Sean

NOTICE AND DISCLAIMER:
This email (including attachments) is confidential.  If you have received
this email in error please notify the sender immediately and delete this
email from your system without copying or disseminating it or placing any
reliance upon its contents.  We cannot accept liability for any breaches of
confidence arising through use of email.  Any opinions expressed in this
email (including attachments) are those of the author and do not necessarily
reflect our opinions.  We will not accept responsibility for any commitments
made by our employees outside the scope of our business.  We do not warrant
the accuracy or completeness of such information.