[erlang-questions] Some facts about Erlang and SMP

Tue Sep 16 11:34:03 CEST 2008

Here are some short facts about how the Erlang SMP implementation
works and how it
relates to performance and scalability.

There will be a more detailed description of how multi core works and
on the future plans available
in a couple of weeks. I plan to include some of this in my
presentation at the ICFP2008, Erlang Workshop in Victoria BC,
September 27

The Erlang VM without SMP support has 1 scheduler which runs in the
main process thread. The scheduler
picks runnable Erlang processes and IO-jobs from the run-queue and
there is no need to lock data structures since
there is only one thread accessing them.

The Erlang VM with SMP support can have 1 to many schedulers which are
run in 1 thread each. The schedulers pick runnable Erlang processes
and IO-jobs from one common run-queue. In the SMP VM all shared data
structures are
protected with locks, the run-queue is one example of a data structure
protected with locks.

>From OTP R12B the SMP version of the VM is automatically started as
default if the OS reports more than 1 CPU (or Core) and with the same
number of schedulers as CPU's or Cores.

You can see what was chosen at the first line of printout from the
"erl" command. E.g.
Erlang (BEAM) emulator version 5.6.4 [source] [smp:4] [asynch-threads:0] .....

The "[smp:4]" above tells that the SMP VM is run and with 4 schedulers.

The default behaviour can be overridden with the
 "-smp [enable|disable|auto]"  auto is default
and to set the number of schedulers, if smp is set to enable or auto
"+S Number"         where Number is the number of schedulers (1..1024)

Note ! that it is normally nothing to gain from running with more
schedulers than the number of CPU's or Cores.
Note2 ! On some operating systems the number of CPU's or Cores to be
used by a process can be restricted
with commands. For example on Linux the command "taskset" can be used
for this. The Erlang VM will
currently only detect number of available CPU's or Cores and will not
take the mask set by "taskset" into account.
Because of this it can happen and has happened that e.g. only 2 Cores
are used even if the Erlang VM
runs with 4 schedulers. It is the OS that limits this because it take
the mask from "taskset" into account.

The schedulers in the Erlang VM are run on one OS-thread each and it
is the OS that decides if the threads are
executed on different Cores. Normally the OS will do this just fine
and will also keep the thread on the same Core throughout the
execution.

The Erlang processes will be run by different schedulers because they
are picked from a common run-queue by
the first scheduler that becomes available.

Performance and scalability
------------------------------------

- The SMP VM with only one scheduler is slightly slower than the non
SMP VM. The SMP VM need to to use all the locks inside but as long as
there are no lock-conflicts the overhead caused by locking is not
significant (it is the lock conflicts that takes time). This explains
why it in some cases can be more efficient to run several SMP VM's
with one scheduler each
instead on one SMP VM with several schedulers. Of course the running
of several VM's require that the application can run
in many parallel tasks which has no or very little communication with
each other.

- If a program scale well with the SMP VM over many cores depends very
much on the characteristics of the program, some programs scale
linearly up to 8 and even 16 cores while other programs barely scale
at all even on 2 cores.
This might sound bad, but in practice many real programs scale well on
the number of cores that are common on the
market today, see below.

- Real telecoms products supporting a massive number if simultaneously
ongoing "calls" represented as one or several
Erlang processes per core have shown very good scalability on dual and
quad core processors. Note, that these products
was written in the normal Erlang style long before the SMP VM and
multi core processors where available and they
could benefit from the Erlang SMP VM without changes and even without
need to recompile the code.

SMP performance is continually improved
------------------------------------------------------

The SMP implementation is continually improved in order to get better
performance and scalability. In each service release
R12B-1, 2, 3, 4, 5 , ..., R13B etc. you will find new optimizations.

Some known bottlenecks
---------------------------------

- The single common run-queue will become a dominant bottleneck when
the number of CPU's or Cores increase.
Will be visible from 4 cores and upwards, but 4 cores will probably
still give ok performance for many applications.
We are working on a solution with one run-queue per scheduler as the
most important improvement right now.

- Ets tables involves locking. Before R12B-4 there was 2 locks
involved in every access to an ets-table, but
in R12B-4 the locking of the meta-table is optimized to reduce the
conflicts significantly (as mentioned earlier it is the conflicts that
are expensive).
If many Erlang processes access the same table there will be a lot of
lock conflicts causing bad performance especially if these processes
spend a majority of their work accessing ets-tables.
The locking is on table-level not on record level.
Note! that this will have impact on Mnesia as well since Mnesia is a
heavy user of ets-tables.

...

Our strategy with SMP
-----------------------------

Already from the beginning when we started implementation of the SMP
VM we decided on the strategy:
"First make it work, then measure, then optimize".
We are still following this strategy consistently since the first
stable working SMP VM that we released in May 2006 (R11B).

There are more known things to improve and we address them one by one
taking the one we think gives most
performance per implementation effort first and so on.

We are putting most focus on getting consistent better scaling on many
cores (more than 4).

Best in class
-----------------

Even if there are a number of known bottlenecks
the SMP system already has good overall performance and scalability
and I believe we are best in class
when it comes to letting the programmer utilize multi -core machines
in an easy productive way.

/Kenneth Erlang/OTP team, Ericsson