multicore performance fine grained concurrency
Johan Montelius
johanmon@REDACTED
Thu May 6 11:03:19 CEST 2010
Hi,
I'm running some benchmarks on a AMD Phenom X4/Kubuntu 9.10/R13B04 and
have problems with the performance when using smp.
The benchmark is a replicated store where each replicated process has an
ets table. The states are kept synchronized by an underlying layer that
creates an additional group process for each replicated process (this will
change so don't worry to much about how to reimplement it).
Having only one replicated process (not much of replication but I want to
see the overhead), an application layer will use a function interface to
send it a message. The replicated process then forwards the message to the
group layer process. The group layer process does not know it is the
leader (again, inefficient but thats not the point), will send it self a
message. When this is received it will forward it to the replicated
process who will do the update of the ets table.
Note these are all asynchronous messages i.e. the application process nor
the replicated process are suspended waiting for a reply. Each application
layer function call will thus generate four messages:
Appl -> Repl -> Group -> Group -> Repl
In a benchmark wit 24000 of these operations (plus one asynchronous
message to make sure that all previous messages have been handled) I have
the following figures:
smp 4:4 -> 126 ms
smp 2:2 -> 143 ms
smp disabled -> 65 ms
:-(
If this is the normal behavior for programs with fine grained concurrency?
On the good side is that the whole setup is to take advantage of the
replicated state to do read operations local without involving the group
layer. With a ratio of 1/32 (one write in 32 operatons) the overhead is
not that big. My initial figures with increaing number of replicas gave
the following nice figures (R is the number of replicas), the 24000
operations are divided equally between R application processes):
smp 4:4 / R = 1 -> 44 ms
smp 4:4 / R = 2 -> 32 ms
smp 4:4 / R = 3 -> 29 ms
smp 4:4 / R = 4 -> 28 ms
smp 4:4 / R = 6 -> 32 ms
smp 4:4 / R = 8 -> 36 ms
This looks great until you compare with smp disabled
smp dis / R = 1 -> 25 ms
smp dis / R = 2 -> 26 ms
smp dis / R = 3 -> 27 ms
smp dis / R = 4 -> 27 ms
So my four cores can almost beat one of the cores :-)
I have not tried on an Intel platform, is there a difference? Should smp
default to 1:1?
Johan
--
Associate Professor Johan Montelius
Royal Institute of Technology - KTH
School of Information and Communication Technology - ICT
More information about the erlang-questions
mailing list