[erlang-questions] benchmarks game harsh criticism

Wed Nov 28 19:53:02 CET 2007

David Hopwood wrote:
> Then let me be more specific. From the FAQ at
> <http://shootout.alioth.debian.org/gp4/faq.php>:
>
> # CPU Time means program usr+sys time (in seconds) which includes the
> # time taken to startup and shutdown the program. For language
> # implementations that use a Virtual Machine the CPU Time includes
> # the time taken to startup and shutdown the VM.
>
> This is an elementary error, sufficiently serious that it's not enough
> just for the FAQ to mention it in passing. It systematically biases the
> results against language implementations with a significant
 startup/shutdown
> time, or other fixed overheads. Combined with the fact that most of the
> benchmarks only run for a few seconds, the resulting bias is quite
 large.

I disagree.  *Some* languages manage to complete the tasks in a few seconds, but
the range of results varies widely.  Take as an example the Fannkuch benchmark (http://shootout.alioth.debian.org/debian/benchmark.php?test=fannkuch〈=all),
where the fastest result on my system is about 6 seconds for N=11, and the slowest
Ruby at ~30 minutes.

Your assertion that "most" run for a few seconds is incorrect.

I'm sorry we don't have tests than run for days (many do run for hours on certain
language implementations), but there are some limits to what we can do as a
matter of practicality.  A full run of the benchmark already takes over 24 hours on
my system.

> Note that just subtracting the time taken by the "startup benchmark"
 would
> not be sufficient to remove this bias, because of variations in
> startup/shutdown/overhead time between programs (even if the comparison
> graphs were set up to do that).

At one time we did subtract the "startup benchmark" time to try to avoid this problem,
but this also resulted in various cries of foul play.

>  The other main factor that makes the shootout almost useless for
> language comparison, is the widely differing amount of optimization
> effort put into the code submissions.

It's a case of Garbage In/Garbage Out.  If people feel see opportunities to optimize the
programs, and those optimizations do not cause the entry to violate the guidelines
(e.g., the Haskell compilers are good at optimizing away meaningless computations,
which makes it hard to compare apples-to-apples) we use the revised program.

Of course, it is easier for people to complain than it is to suggest optimizations or
provide better solutions.

-Brent