[erlang-questions] benchmarks game harsh criticism

Thu Nov 29 09:14:33 CET 2007

On Nov 28, 2007, at 7:42 PM, David Hopwood wrote:

> The times that take longer than a few seconds don't affect my point
> that there is systematic bias against language implementations with
> significant startup/shutdown times.

You say this as though significant startup/shutdown times should be  
considered
acceptable.  I disagree -- take for example SBCL or SML/NJ, both of  
which have
sizeable runtimes and yet manage to produce very good times.

> Apart from the language implementations
> for which performance is not really a serious goal, many of the  
> 'outlier'
> times are due to little or no attention having been paid to  
> optimizing that
> benchmark submission, so that the result is pretty meaningless anyway.

This may be true, but I do not know which contributions have had large  
amounts
of optimizations and which do not.  I can speak for the relatively  
high level of
effort put into the GHC, Objective Caml, SML, C, C++, Clean, and some  
of the Erlang
entries, because I either was involved in the optimizations, or saw  
the mailing list
discussions of the various iterations.

Various language implementors (or proponents) often provide  
implementations.
I assume they are the experts in those domains since they are seeking  
to put
the best face on their language.

If you are aware of particularly egregious examples of bad  
implementations, I
would suggest you make a note of it (perhaps on the Shootout bug  
tracker)
so they can be addressed.

> There are also many results at the low end of the CPU times that  
> strain
> credibility, if they are supposed to be interpreted as useful  
> comparisons
> of language implementation performance (even on hypothetical code  
> for which
> that benchmark would be representative). For example, look at
>
> <http://shootout.alioth.debian.org/gp4/benchmark.php?test=nsieve&lang=all 
> >,

You make a valid point:  Clearly the nsieve and perhaps the mandelbrot  
tests
need to have higher N values.  But in the other entries I see larger  
variations
in the results; at least enough that the differences seem more likely  
to be
caused by real differences in language implementations.

> Another basic mistake is that there is no indication of the variation
> in timing between benchmark runs. At least, not without digging a bit
> deeper:  the excluded "Haskell GHC #4" result for N=9 on nsieve is  
> 1.12 s,
> but in the full results for nsieve-bits, the result for N=9 on exactly
> the same program run by the same language implementation is 0.80 s.
> So, we have some reason to believe that the timings for benchmark runs
> this short can vary by as much as 40% (greater than many of the
> differences between languages), and the site doesn't give us any
> evidence by which we could trust the timings to be more accurate than
> that.

This looks like some kind of local problem on the GP4 run.  If I look  
at the two
runs on my build system:

http://shootout.alioth.debian.org/debian/benchmark.php?test=nsieve&lang=ghc&id=4
http://shootout.alioth.debian.org/debian/benchmark.php?test=nsievebits&lang=ghc&id=4

They both show the same result (0.74 seconds) with some minor  
variation in memory use.
The data for all runs are shown on the site, though I will admit you  
have to click an extra
link to see it.  Not everyone cares to see the entire grid of results.

>> I'm sorry we don't have tests than run for days
>
> You would have people draw conclusions from benchmarks that run for
> 2 seconds (even 0.55 s in some cases). That's ridiculous. It's a  
> lesson
> in how not to do benchmarking.

I would have people draw conclusions from benchmarks where the fastest  
programs
are 0.55 s, and the slowest are 350 seconds.  Are three orders of  
magnitude in
variation not sufficient to draw some conclusions?

If you are concerned with the variation in the sub-second results, it  
would be reasonable
to cull out the worst implementations and perform tests for higher  
values of N.  But that
would prevent a wide range of entries, so to some extent it is a  
matter of limiting
ourselves to the lowest common denominator.

>> At one time we did subtract the "startup benchmark" time to try to  
>> avoid
>> this problem, but this also resulted in various cries of foul play.
>
> As it should -- one better way to handle this problem is to make the
> run time long enough that the startup/shutdown/overhead time becomes
> insignificant (because that's what happens with real programs).

That's what happens with *some* programs.  There are many cases where
programs run for very short times.  Is it not useful to know which  
language
implementations provide good results under these conditions?

One valid conclusion you could draw from the current benchmark design
is that Java is a very poor choice for short running applications  
(perhaps
something run every few minutes by a cron job).

I agree that the benchmarks don't currently tell you much about a  
database
front-end for an e-commerce website, but that wasn't really what we were
attempting to measure at the time to shootout was started.

> Also, if the design of the benchmark suite and the submission rules
> are such that we might expect much of the input to be "garbage",
> then it's reasonable to criticise that design, not the individual
> submissions.

I am not aware of anything in the submission rules that has a goal of
encouraging 'garbage' submissions.

> If a Haskell (or any other) compiler is able to optimize away a
> substantial part of the computation intended to be performed by a
> benchmark, that indicates it wasn't a very good benchmark, even for
> the language implementations that don't do this optimization.

I threw that out as an example of why the benchmarks are sometimes
specified by algorithm rather than just indicating the desired output.

Earlier we began throwing out historical benchmark programs (e.g., the  
old
"loop" test) because they suffered from this problem.  Most of the newer
benchmarks are designed to avoid this.

> As for accepting revisions, that's fine as far as it goes, but it
> doesn't go far enough. To trust the results of a comparison, I don't
> need to know that anyone *could* improve each submission, I need to
> know that the benchmark submissions that are used in that comparison
> *have* all been reasonably well optimized. Unless I review the
> submissions myself (assuming that I'm a competent programmer in
> the languages concerned), how am I supposed to know that?

Then I'm afraid we will have to sadly live with the knowledge that we  
have failed
to satisfy your criteria.  Even if Isaac and I spent the rest of the  
year analyzing
every program submission in the benchmark, I doubt that would do much  
to ensure
that all entries have been "reasonably well optimized."  How could  
it?  I cannot
claim to be an expert in all of these languages.

I suppose that a dedicated team of experts (perhaps funded by a swanky
University) could provide this level of assurance, but I'm afraid that  
the
unfunded shootout has to get by on the kindness of strangers.

-Brent

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20071129/ed050006/attachment.htm>