multicore performance fine grained concurrency

Johan Montelius johanmon@REDACTED
Thu May 6 11:03:19 CEST 2010



Hi,

I'm running some benchmarks on a AMD Phenom X4/Kubuntu 9.10/R13B04 and  
have problems with the performance when using smp.

The benchmark is a replicated store where each replicated process has an  
ets table. The states are kept synchronized by an underlying layer that  
creates an additional group process for each replicated process (this will  
change so don't worry to much about how to reimplement it).

Having only one replicated process (not much of replication but I want to  
see the overhead), an application layer will use a function interface to  
send it a message. The replicated process then forwards the message to the  
group layer process. The group layer process does not know it is the  
leader (again, inefficient but thats not the point), will send it self a  
message. When this is received it will forward it to the replicated  
process who will do the update of the ets table.

Note these are all asynchronous messages i.e. the application process nor  
the replicated process are suspended waiting for a reply. Each application  
layer function call will thus generate four messages:

Appl -> Repl -> Group -> Group -> Repl

In a benchmark wit 24000 of these operations (plus one asynchronous  
message to make sure that all previous messages have been handled) I have  
the following figures:

smp 4:4 -> 126 ms
smp 2:2 -> 143 ms
smp disabled -> 65 ms

:-(

If this is the normal behavior for programs with fine grained concurrency?


On the good side is that the whole setup is to take advantage of the  
replicated state to do read operations local without involving the group  
layer. With a ratio of 1/32 (one write in 32 operatons) the overhead is  
not that big. My initial figures with increaing number of replicas gave  
the following nice figures (R is the number of replicas), the 24000  
operations are divided equally between R application processes):


smp 4:4 / R = 1 -> 44 ms
smp 4:4 / R = 2 -> 32 ms
smp 4:4 / R = 3 -> 29 ms
smp 4:4 / R = 4 -> 28 ms
smp 4:4 / R = 6 -> 32 ms
smp 4:4 / R = 8 -> 36 ms

This looks great until you compare with smp disabled

smp dis / R = 1 -> 25 ms
smp dis / R = 2 -> 26 ms
smp dis / R = 3 -> 27 ms
smp dis / R = 4 -> 27 ms

So my four cores can almost beat one of the cores :-)

I have not tried on an Intel platform, is there a difference? Should smp  
default to 1:1?

   Johan


-- 
Associate Professor Johan Montelius
Royal Institute of Technology - KTH
School of Information and Communication Technology - ICT


More information about the erlang-questions mailing list