[erlang-questions] benchmarks game harsh criticism

Thu Nov 29 04:42:24 CET 2007

Brent Fulgham wrote:
> David Hopwood wrote:
>> Then let me be more specific. From the FAQ at
>> <http://shootout.alioth.debian.org/gp4/faq.php>:
>>
>> # CPU Time means program usr+sys time (in seconds) which includes the
>> # time taken to startup and shutdown the program. For language
>> # implementations that use a Virtual Machine the CPU Time includes
>> # the time taken to startup and shutdown the VM.
>>
>> This is an elementary error, sufficiently serious that it's not enough
>> just for the FAQ to mention it in passing. It systematically biases the
>> results against language implementations with a significant
>> startup/shutdown time, or other fixed overheads. Combined with the fact
>> that most of the benchmarks only run for a few seconds, the resulting
>> bias is quite large.
> 
> I disagree.  *Some* languages manage to complete the tasks in a few seconds,
> but the range of results varies widely.

*Most* benchmark runs complete in less than 30 seconds (see below). I didn't
say anything about the range of results.

> Your assertion that "most" run for a few seconds is incorrect.

No, it is clearly correct.

Proportion of language implementations for which each benchmark included in
the default weighting takes less than 10, 30, and 60 seconds CPU time:

                       <10s        <30s        <60s       out of
  binary trees         19 (54%)    23 (66%)    25 (71%)     35
  fannkuch             10 (26%)    22 (58%)    22 (58%)     38
  fasta                0  (0%)     15 (37%)    20 (49%)     41
  k-nucleotide         4  (12%)    16 (47%)    24 (71%)     34
  mandelbrot           23 (61%)    24 (63%)    25 (66%)     38
  n-body               0  (0%)     19 (51%)    23 (62%)     37
  nsieve               27 (68%)    33 (83%)    37 (93%)     40
  nsieve-bits          26 (65%)    32 (80%)    32 (80%)     40
  partial-sums         22 (58%)    34 (89%)    38 (100%)    38
  pidigits             27 (69%)    37 (95%)    38 (97%)     39
  recursive            21 (55%)    24 (63%)    27 (71%)     38
  regex-dna            19 (59%)    29 (91%)    30 (94%)     32
  reverse-complement   35 (83%)    37 (88%)    40 (95%)     42
  spectral-norm        0  (0%)     13 (38%)    19 (56%)     34
  sum-file             23 (47%)    34 (69%)    40 (82%)     49
  thread-ring          3  (23%)    3  (23%)    6  (46%)     13

  total                259 (47%)   395 (72%)   446 (81%)    548

72% under 30 seconds is "most" in letter and spirit. (And yes, I have too
much time on my hands :-)

The times that take longer than a few seconds don't affect my point
that there is systematic bias against language implementations with
significant startup/shutdown times. Apart from the language implementations
for which performance is not really a serious goal, many of the 'outlier'
times are due to little or no attention having been paid to optimizing that
benchmark submission, so that the result is pretty meaningless anyway.

There are also many results at the low end of the CPU times that strain
credibility, if they are supposed to be interpreted as useful comparisons
of language implementation performance (even on hypothetical code for which
that benchmark would be representative). For example, look at

<http://shootout.alioth.debian.org/gp4/benchmark.php?test=nsieve&lang=all>,

and tell me you really believe that for 14 implementations of languages
as different as C, Eiffel, Fortran, BASIC, Forth and Haskell, a range
in CPU times from 2.07 to 2.12 seconds is accurately reflecting the
relative performance of a naïve Sieve of Eratosthenes on those
implementations. Isn't it more likely that there is some source of
constant overhead that is causing the variation between languages to
be compressed?

Another basic mistake is that there is no indication of the variation
in timing between benchmark runs. At least, not without digging a bit
deeper: the excluded "Haskell GHC #4" result for N=9 on nsieve is 1.12 s,
but in the full results for nsieve-bits, the result for N=9 on exactly
the same program run by the same language implementation is 0.80 s.
So, we have some reason to believe that the timings for benchmark runs
this short can vary by as much as 40% (greater than many of the
differences between languages), and the site doesn't give us any
evidence by which we could trust the timings to be more accurate than
that.

> I'm sorry we don't have tests than run for days

You would have people draw conclusions from benchmarks that run for
2 seconds (even 0.55 s in some cases). That's ridiculous. It's a lesson
in how not to do benchmarking.

> (many do run for hours on certain language implementations),

which, for these problem sizes, probably indicates only that the submitted
code is unoptimized, and not suitable for use in a benchmark.

> but there are some limits to what we can do as a matter of practicality.
> A full run of the benchmark already takes over 24 hours on my system.

That's because you're running lots of poorly optimized code that is not
suitable for benchmarking.

If, for the sake of argument, it were not possible to run this benchmark
suite for long enough to get meaningful results, then that is not a
justification for running it anyway, and publishing meaningless and
misleading results.

>> Note that just subtracting the time taken by the "startup benchmark"
>> would not be sufficient to remove this bias, because of variations in
>> startup/shutdown/overhead time between programs (even if the comparison
>> graphs were set up to do that).
> 
> At one time we did subtract the "startup benchmark" time to try to avoid
> this problem, but this also resulted in various cries of foul play.

As it should -- one better way to handle this problem is to make the
run time long enough that the startup/shutdown/overhead time becomes
insignificant (because that's what happens with real programs).

>> The other main factor that makes the shootout almost useless for
>> language comparison, is the widely differing amount of optimization
>> effort put into the code submissions.
> 
> It's a case of Garbage In/Garbage Out.

It's a case of Garbage Out even for non-garbage input.

Also, if the design of the benchmark suite and the submission rules
are such that we might expect much of the input to be "garbage",
then it's reasonable to criticise that design, not the individual
submissions.

> If people feel see opportunities to optimize the programs, and those
> optimizations do not cause the entry to violate the guidelines
> (e.g., the Haskell compilers are good at optimizing away meaningless
> computations, which makes it hard to compare apples-to-apples)
> we use the revised program.

If a Haskell (or any other) compiler is able to optimize away a
substantial part of the computation intended to be performed by a
benchmark, that indicates it wasn't a very good benchmark, even for
the language implementations that don't do this optimization.

Put it this way: why would we want to know the performance of a
meaningless computation?

The output of a good benchmark is specified in such a way that all of
the intended computation is required in order to generate it. But if
one language implementation does manage to "optimize out" more of the
computation for a given submission than another, then a policy of
excluding these submissions is inequitable: how do we know that the
optimization isn't improving the performance of the real programs
that the benchmark is supposed to be representative of?
(Presumably it is in at least some cases, otherwise the compiler
developers wouldn't have put effort into implementing it.)

As for accepting revisions, that's fine as far as it goes, but it
doesn't go far enough. To trust the results of a comparison, I don't
need to know that anyone *could* improve each submission, I need to
know that the benchmark submissions that are used in that comparison
*have* all been reasonably well optimized. Unless I review the
submissions myself (assuming that I'm a competent programmer in
the languages concerned), how am I supposed to know that? And
having each potential user of the site do that review, obviously
isn't an efficient use of their time. There isn't even any way to
express comments about a submission that can be seen by other users.

> Of course, it is easier for people to complain than it is to suggest
> optimizations or provide better solutions.

It is not my responsibility to fix elementary errors in the design
of the 'shootout'. To achieve my immediate goal of having fewer people
be misled by treating it as a useful way of comparing performance
of language implementations, I only need to point out these errors.

-- 
David Hopwood