[erlang-questions] benchmarks game harsh criticism

Thu Nov 29 06:09:56 CET 2007

Isaac Gouy wrote:
> --- David Hopwood <david.hopwood@REDACTED> wrote:
> -snip-
>> Then let me be more specific.
> 
> Thank you for being more specific.
> 
>> From the FAQ at <http://shootout.alioth.debian.org/gp4/faq.php>:
>>
>> # CPU Time means program usr+sys time (in seconds) which includes the
>> # time taken to startup and shutdown the program. For language
>> # implementations that use a Virtual Machine the CPU Time includes
>> # the time taken to startup and shutdown the VM.
>>
>> This is an elementary error, sufficiently serious that it's not
>> enough just for the FAQ to mention it in passing. It systematically
>> biases the results against language implementations with a
>> significant startup/shutdown time, or other fixed overheads. Combined
>> with the fact that most of the benchmarks only run for a few seconds,
>> the resulting bias is quite large.
> 
> Specifically how large is the resulting bias?

Probably about 10% in some cases (for JVM-based implementations and
Smalltalk).

> Is it large enough that we should reassess the 97.6 seconds that the
> HiPE program takes for fannkuch down to the 5.99 seconds taken by the C
> program, or only large enough that we should reassess it to 97.0
> seconds?

My comments were not in any way specific to Erlang. (For Erlang,
you should be looking at the effect on the pidigits benchmark, for
example, which takes 4.36 seconds in HiPE.)

The largest bias is likely to be against the JVM-based language
implementations (Java, Nice, CAL and Scala) and Smalltalk.

> -snip-
>> The other main factor that makes the shootout almost useless for
>> language comparison, is the widely differing amount of optimization
>> effort put into the code submissions.
> 
> Firstly, I think that may be a criticism of benchmarks in general, I
> don't recall seeing published benchmarks with a statement of how much
> optimization effort was put into each program. 

If you write a benchmark yourself, for example, you know how much
effort has been put into optimizing it. So this isn't a criticism that
necessarily applies to all benchmarks (but even if it was, that wouldn't
stop it from being a valid criticism of the shootout).

> Secondly, I don't think you know that there was a widely differing
> optimization effort - it's just an assumption.

It is based on having seen discussions of the shootout on several
language mailing lists and newsgroups, and observing the variation in
effort that was put into improving the submissions in each case.

-- 
David Hopwood