[erlang-questions] Fwd: About a multicore benchmark

Sun Nov 12 13:44:27 CET 2017

> Hi, I am trying to do a simple benchmark against python, and we decide to
> use factorial function to the test
>
> fact(0)->1;
> fact(N)->N*fact(N-1).
>
> But I wonder if there is a way to implement this using a multicore
> approach.

Fairly obviously, yes.
Suppose you have k cores.
Core 1 computes 1 * (k+1) * (2k+1) ...
Core 2 computes 2 * (k+2) * (2k+2) ...
Core 3 computes 3 * (k+3) * (2k+3) ...
...
Core k computes k * (k+k) * (2k+k) ...
and finally you take the product of their answers.

The problem is that it is difficult to see this benchmark as telling
you anything worth knowing.  Factorials get very big very fast.  So
for any N big enough to be worth measuring, what you are really
measuring is the speed of the under-lying bignum library.  I expect
that Erlang and Python use the *same* bignum library (gmp), so what
does that tell you about anything *else* in Erlang and Python?

For what it's worth, here's your benchmark in Smalltalk:
Time millisecondsToRun: [20000 factorial]
System A: 355 msec.
System B: 347 msec (noise level difference).
System C: 189 msec.
What does this actually tell us about these systems?
Very little.  It just so happens that System C (which is mine)
has a carefully written #factorial function which runs, in
System A or System B, about 40% faster than the naive code they
have.  The underlying bignum library in System C is actually
significantly *worse* than the ones in Systems A and B for
large enough products.

If you want some marginally less unrealistic benchmarks,
you could look at "The Computer Languages Benchmarks Game"
http://benchmarksgame.alioth.debian.org/
You should certainly read
http://benchmarksgame.alioth.debian.org/why-measure-toy-benchmark-programs.html

The Chameneos-Redux benchmark has
Erlang at 62.3 seconds and Python3 at 236.84 seconds.

However, when I tried my own implementation in Smalltalk,
it turned out that almost all the time was going into
shared queue operations and a small tweak to shared queues
*doubled* the speed of the program.

If for example you turn to the NBody benchmark, what you
mostly learn is that programming languages that use "boxed"
floating point numbers do much worse than languages with
unboxed ones, unless they have exceedingly clever compilers.
Erlang's performance suffers for that reason, but Python
does worse.  Here again, a relatively simple compiler hack
(turning  x := t * vx + x  into if t, vx, and x are all
FloatD, then BOXD(UNBOXD(t)*UNBOXD(vx)+UNBOXD(x) else ...)
can halve the amount of boxing, thus doubling the speed.
This is something I believe HiPE can do, given suitable
type information.

The lesson is that small benchmarks can be extremely
sensitive to implementation details that are much less
unfluential for large scale systems.

Benchmarking is a wonderful way to ask "where is the time
going? how can I do better?" but a lousy way to compare
languages.