multicore performance - smp

Sat May 15 14:50:18 CEST 2010

dear list,

i'm performing some tests to optimize multicore usage. what i want to
understand is the best way to use the 4 cores that my machine has, in
a very simple erlang computation.

i've written this module which basically spawns a very small number of
processes which do a trivial but intensive computation, and report to
a registered 'counter' process when done.

here's the test module:

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

-module(mtest).
-compile(export_all).
-define(NUMCOMPUTE, 100000000).
-define(NUMTIMES, 10).
-define(TIMEOUT, 60000).

start(ProcNum) ->
	% start counter
	register(counter, spawn(fun() -> counter_loop(now(), ProcNum, ProcNum) end)),
	% start compute processes
	[spawn(?MODULE, increase, [0, 0]) || _K <- lists:seq(1, ProcNum)].	

% computational processes
increase(?NUMCOMPUTE, ?NUMTIMES) -> counter ! finished;
increase(?NUMCOMPUTE, Times) -> increase(0, Times + 1);
increase(N, Times) -> increase(N + 1, Times).

% counter loop
counter_loop(Start, 0, ProcNum) ->
	T = timer:now_diff(now(), Start),
	io:format("COMPUTED ADD TO ~p FOR ~p TIMES IN ~p PROCESS(ES) IN ~p
SECONDS~n", [?NUMCOMPUTE, ?NUMTIMES, ProcNum, T/1000000]);
counter_loop(Start, ProcsLeft, ProcNum) ->
	receive
		finished ->
			counter_loop(Start, ProcsLeft - 1, ProcNum);
		_ ->
			counter_loop(Start, ProcsLeft, ProcNum)
	after ?TIMEOUT -> timeout
	end.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

when i run this module with kernel poll and smp enabled, i get:

roberto$ erl +K true
Erlang R13B04 (erts-5.7.5) [source] [smp:4:4] [rq:4] [async-threads:0]
[hipe] [kernel-poll:true]

Eshell V5.7.5  (abort with ^G)
1> c(mtest).
{ok,mtest}
2> mtest:start(1).
[<0.40.0>]
COMPUTED ADD TO 100000000 FOR 10 TIMES IN 1 PROCESS(ES) IN 8.100901 SECONDS
3> mtest:start(2).
[<0.43.0>,<0.44.0>]
COMPUTED ADD TO 100000000 FOR 10 TIMES IN 2 PROCESS(ES) IN 10.367364 SECONDS
4> mtest:start(3).
[<0.47.0>,<0.48.0>,<0.49.0>]
COMPUTED ADD TO 100000000 FOR 10 TIMES IN 3 PROCESS(ES) IN 13.541443 SECONDS
5> mtest:start(4).
[<0.52.0>,<0.53.0>,<0.54.0>,<0.55.0>]
COMPUTED ADD TO 100000000 FOR 10 TIMES IN 4 PROCESS(ES) IN 17.512607 SECONDS
6>

i would have expected that, since 1 process running on 1 core takes
around 8 seconds, running the code on 4 processes [hence on 4 cores]
would have taken only marginally more time [for smp, startup, etc].
however, i see here that 4 cores are taking twice as much time that i
would have expected.

in this kind of situations, as reported also in other topics, it seems
that using 4 erlang instances with smp disabled would definitely allow
me to run the very same test in 8 seconds, not in 17. i'm prepared to
loose a little for smp, which is normal, but adding twice up is
definitely a high cost.

i would have thought that this kind of parallelism would have been
handled with no hassle in erlang: am i approaching this problem in a
wrong manner? what should i do instead to achieve my expected results?

thank you,

r.