multicore performance fine grained concurrency

Fri May 7 10:24:44 CEST 2010

During the last months I did a lot of benchmarking SMP, not to learn
how Erlang SMP performs but rather to learn how to write code that
scales well with Erlang SMP. I had different machines available, even
8 cores with hyperthreading resulting in 16 logical cores.

What I found is that it is not *that* easy to write code that scales
(almost) linearly, but it is possible. Bad scaling behaviour almost
always uncovers serial bottlenecks in the code (like the ring, I
guess).

On the other hand, if you have code with a lot of serial paths you
benefit from turning off SMP because the SMP scheduling has its own
overhead. This is natural.

On May 7, 9:27 am, Raimo Niskanen <raimo+erlang-
questi...@REDACTED> wrote:
> On Thu, May 06, 2010 at 01:50:23PM -0400, David N Murray wrote:
> > On May 6, Johan Montelius scribed:
>
> > > smp 4:4 -> 126 ms
> > > smp 2:2 -> 143 ms
> > > smp disabled -> 65 ms
>
> > > :-(
>
> > I saw something similar using the Ring benchmark on both AMD (OpenBSD) and
> > Intel (Vista) dual cores.  Both cores get utilized in the 40-50% range
> > with smp enabled.  It takes 1/4 the time to run the benchmark with smp
> > disabled as it does when smp is enabled. Takes advantage of two cores just
> > fine if you run two OS processes with SMP disabled.  Doesn't do so well
> > SMP enabled.  The ring benchmark just spawns and sends messages.
>
> There are many different reasons why SMP, especially SMP benchmarks
> (as Kenneth explained in another mail in this thread) performs poorly.
>
> OpenBSD still does not have native threads, so one OS process only runs on
> one CPU at the time. Threads are implemented as old style (green)
> threads within that process. The SMP emulator starts, probably with
> as many schedulers as there are CPUs, runs on both schedulers
> within one CPU thread, and the OS distributes that load over
> both CPUs. So max possible utilization will be 50% per CPU.
>
> Vista seems to be very eager to distribute the load over the CPUs,
> so execution jumps between them like crazy, which destroys the
> CPU memory cache for every jump, slowing down execution.
>
> Intel before i7/i5 has much less memory bandwidth, especially
> between cores, so the SMP emulator performs worse on them than
> on i7/i5.
>
> Our current best combo I guess is Linux (perhaps Solaris maybe
> FreeBSD) on Intel I7.
>
> And for the ring benchmark (single ring) is it not so that it sends
> a message in a ring so there is only one process at every instant that
> can execute.  So all the SMP emulator can contribute is overhead, making
> this benchmark only measure how much overhead the SMP emulator has
> for pure message passing and process scheduling. The SMP emulator
> can never beat the non-SMP emulator for a single ring and it can
> only load one CPU no more than 100%.
>
>
>
> > smp 2:2 -> 18003 ms
> > smp disabled -> 4867 ms
> > 2 os processes -> ~5900 ms
>
> > hth,
> > Dave
>
> > ________________________________________________________________
> > erlang-questions (at) erlang.org mailing list.
> > Seehttp://www.erlang.org/faq.html
> > To unsubscribe; mailto:erlang-questions-unsubscr...@REDACTED
>
> --
>
> / Raimo Niskanen, Erlang/OTP, Ericsson AB
>
> ________________________________________________________________
> erlang-questions (at) erlang.org mailing list.
> Seehttp://www.erlang.org/faq.html
> To unsubscribe; mailto:erlang-questions-unsubscr...@REDACTED
>
> --
> You received this message because you are subscribed to the Google Groups "Erlang Programming" group.
> To post to this group, send email to erlang-programming@REDACTED
> To unsubscribe from this group, send email to erlang-programming+unsubscribe@REDACTED
> For more options, visit this group athttp://groups.google.com/group/erlang-programming?hl=en.