<html><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><br><div><div>On Nov 28, 2007, at 7:42 PM, David Hopwood wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite">The times that take longer than a few seconds don't affect my point<br>that there is systematic bias against language implementations with<br>significant startup/shutdown times.</blockquote><div><br class="webkit-block-placeholder"></div><div>You say this as though significant startup/shutdown times should be considered</div><div>acceptable.  I disagree -- take for example SBCL or SML/NJ, both of which have</div><div>sizeable runtimes and yet manage to produce very good times.</div><br><blockquote type="cite"> Apart from the language implementations<br>for which performance is not really a serious goal, many of the 'outlier'<br>times are due to little or no attention having been paid to optimizing that<br>benchmark submission, so that the result is pretty meaningless anyway.<br></blockquote><div><br class="webkit-block-placeholder"></div>This may be true, but I do not know which contributions have had large amounts</div><div>of optimizations and which do not.  I can speak for the relatively high level of</div><div>effort put into the GHC, Objective Caml, SML, C, C++, Clean, and some of the Erlang</div><div>entries, because I either was involved in the optimizations, or saw the mailing list</div><div>discussions of the various iterations.</div><div><br class="webkit-block-placeholder"></div><div>Various language implementors (or proponents) often provide implementations.</div><div>I assume they are the experts in those domains since they are seeking to put</div><div>the best face on their language.<br><div><br class="webkit-block-placeholder"></div><div>If you are aware of particularly egregious examples of bad implementations, I</div><div>would suggest you make a note of it (perhaps on the Shootout bug tracker)</div><div>so they can be addressed.</div><br><blockquote type="cite">There are also many results at the low end of the CPU times that strain<br>credibility, if they are supposed to be interpreted as useful comparisons<br>of language implementation performance (even on hypothetical code for which<br>that benchmark would be representative). For example, look at<br><br><<a href="http://shootout.alioth.debian.org/gp4/benchmark.php?test=nsieve&lang=all">http://shootout.alioth.debian.org/gp4/benchmark.php?test=nsieve&lang=all</a>>,<br></blockquote><div><br class="webkit-block-placeholder"></div>You make a valid point:  Clearly the nsieve and perhaps the mandelbrot tests</div><div>need to have higher N values.  But in the other entries I see larger variations</div><div>in the results; at least enough that the differences seem more likely to be</div><div>caused by real differences in language implementations.</div><div><div><br></div><blockquote type="cite">Another basic mistake is that there is no indication of the variation<br>in timing between benchmark runs. At least, not without digging a bit<br>deeper:  <span class="Apple-style-span" style="-webkit-text-stroke-width: -1; ">the excluded "Haskell GHC #4" result for N=9 on nsieve is 1.12 s,</span></blockquote><blockquote type="cite">but in the full results for nsieve-bits, the result for N=9 on exactly<br>the same program run by the same language implementation is 0.80 s.<br>So, we have some reason to believe that the timings for benchmark runs<br>this short can vary by as much as 40% (greater than many of the<br>differences between languages), and the site doesn't give us any<br>evidence by which we could trust the timings to be more accurate than<br>that.</blockquote><div><br class="webkit-block-placeholder"></div><div>This looks like some kind of local problem on the GP4 run.  If I look at the two</div><div>runs on my build system:</div><div><br class="webkit-block-placeholder"></div><div><a href="http://shootout.alioth.debian.org/debian/benchmark.php?test=nsieve&lang=ghc&id=4">http://shootout.alioth.debian.org/debian/benchmark.php?test=nsieve&lang=ghc&id=4</a></div><div><a href="http://shootout.alioth.debian.org/debian/benchmark.php?test=nsievebits&lang=ghc&id=4">http://shootout.alioth.debian.org/debian/benchmark.php?test=nsievebits&lang=ghc&id=4</a></div><div><br class="webkit-block-placeholder"></div><div>They both show the same result (0.74 seconds) with some minor variation in memory use.</div><div>The data for all runs are shown on the site, though I will admit you have to click an extra</div><div>link to see it.  Not everyone cares to see the entire grid of results.</div><div><br></div><blockquote type="cite"><blockquote type="cite">I'm sorry we don't have tests than run for days<br></blockquote><br>You would have people draw conclusions from benchmarks that run for<br>2 seconds (even 0.55 s in some cases). That's ridiculous. It's a lesson<br>in how not to do benchmarking.<br></blockquote><div><br class="webkit-block-placeholder"></div><div>I would have people draw conclusions from benchmarks where the fastest programs</div><div>are 0.55 s, and the slowest are 350 seconds.  Are three orders of magnitude in</div><div>variation not sufficient to draw some conclusions?</div><div><br class="webkit-block-placeholder"></div><div>If you are concerned with the variation in the sub-second results, it would be reasonable</div><div>to cull out the worst implementations and perform tests for higher values of N.  But that</div><div>would prevent a wide range of entries, so to some extent it is a matter of limiting</div><div>ourselves to the lowest common denominator.</div><blockquote type="cite"></blockquote><br><blockquote type="cite"><blockquote type="cite">At one time we did subtract the "startup benchmark" time to try to avoid<br></blockquote><blockquote type="cite">this problem, but this also resulted in various cries of foul play.<br></blockquote><br>As it should -- one better way to handle this problem is to make the<br>run time long enough that the startup/shutdown/overhead time becomes<br>insignificant (because that's what happens with real programs).<br></blockquote><div><br class="webkit-block-placeholder"></div>That's what happens with *some* programs.  There are many cases where</div><div>programs run for very short times.  Is it not useful to know which language</div><div>implementations provide good results under these conditions?</div><div><br class="webkit-block-placeholder"></div><div>One valid conclusion you could draw from the current benchmark design</div><div>is that Java is a very poor choice for short running applications (perhaps</div><div>something run every few minutes by a cron job).</div><div><br class="webkit-block-placeholder"></div><div>I agree that the benchmarks don't currently tell you much about a database</div><div>front-end for an e-commerce website, but that wasn't really what we were</div><div>attempting to measure at the time to shootout was started.</div><div><br class="webkit-block-placeholder"></div><div><blockquote type="cite">Also, if the design of the benchmark suite and the submission rules<br>are such that we might expect much of the input to be "garbage",<br>then it's reasonable to criticise that design, not the individual<br>submissions.<br></blockquote><div><br class="webkit-block-placeholder"></div><div>I am not aware of anything in the submission rules that has a goal of</div><div>encouraging 'garbage' submissions.</div><div><br class="webkit-block-placeholder"></div><blockquote type="cite">If a Haskell (or any other) compiler is able to optimize away a<br>substantial part of the computation intended to be performed by a<br>benchmark, that indicates it wasn't a very good benchmark, even for<br>the language implementations that don't do this optimization.<br></blockquote><div><br class="webkit-block-placeholder"></div>I threw that out as an example of why the benchmarks are sometimes</div><div>specified by algorithm rather than just indicating the desired output.</div><div><br class="webkit-block-placeholder"></div><div>Earlier we began throwing out historical benchmark programs (e.g., the old</div><div>"loop" test) because they suffered from this problem.  Most of the newer</div><div>benchmarks are designed to avoid this.<br><div><br></div></div><div><blockquote type="cite">As for accepting revisions, that's fine as far as it goes, but it<br>doesn't go far enough. To trust the results of a comparison, I don't<br>need to know that anyone *could* improve each submission, I need to<br>know that the benchmark submissions that are used in that comparison<br>*have* all been reasonably well optimized. Unless I review the<br>submissions myself (assuming that I'm a competent programmer in<br>the languages concerned), how am I supposed to know that?</blockquote><div><br class="webkit-block-placeholder"></div><div>Then I'm afraid we will have to sadly live with the knowledge that we have failed</div><div>to satisfy your criteria.  Even if Isaac and I spent the rest of the year analyzing</div><div>every program submission in the benchmark, I doubt that would do much to ensure</div><div>that all entries have been "reasonably well optimized."  How could it?  I cannot</div><div>claim to be an expert in all of these languages.</div><div><br class="webkit-block-placeholder"></div><div>I suppose that a dedicated team of experts (perhaps funded by a swanky</div><div>University) could provide this level of assurance, but I'm afraid that the</div><div>unfunded shootout has to get by on the kindness of strangers.</div><div><br class="webkit-block-placeholder"></div><div>-Brent</div></div><div><font class="Apple-style-span" face="Verdana" size="3"><span class="Apple-style-span" style="font-size: 11px; line-height: 18px;"><br></span></font></div><div><br class="webkit-block-placeholder"></div><div><br class="webkit-block-placeholder"></div></body></html>