[erlang-questions] multicore performance - smp

Sat May 15 16:59:14 CEST 2010

Just to come back with some numbers:

If I run it once here:

Erlang R13B04 (erts-5.7.5) [smp:4:4] [rq:4] [async-threads:0]

2> mtest:start(1).
[<0.36.0>]
COMPUTED ADD TO 100000000 FOR 10 TIMES IN 1 PROCESS(ES) IN 54.647
SECONDS

3> mtest:start(4).
[<0.50.0>,<0.51.0>,<0.52.0>,<0.53.0>]
COMPUTED ADD TO 100000000 FOR 10 TIMES IN 4 PROCESS(ES) IN 49.546
SECONDS

I already have a speedup here, with only 4 processes.

then 20 times:

2> mtest:start(20).
[<0.270.0>,<0.271.0>,<0.272.0>,<0.273.0>,<0.274.0>,
<0.275.0>,<0.276.0>,<0.277.0>,<0.278.0>,<0.279.0>,<0.280.0>,
<0.281.0>,<0.282.0>,<0.283.0>,<0.284.0>,<0.285.0>,<0.286.0>,
<0.287.0>,<0.288.0>,<0.289.0>]
COMPUTED ADD TO 100000000 FOR 10 TIMES IN 20 PROCESS(ES) IN 188.824
SECONDS

which averages to 9.44s per process. You can see that I get a 5.8 times
speedup per process by running it that way.

On the other hand, a big surprise is that it takes you 8 seconds for 1
process and 55 for me. I guess it could be that HiPE doesn't run on my
windows version, with the benchmark being numeric computations by nature...

8> c(mtest, [native]).
./mtest.erl:none: Warning: this system is not configured for native-code
compilation.

On Sat, May 15, 2010 at 10:14 AM, Fred Hebert <mononcqc@REDACTED> wrote:

> Having 4 processes doesn't mean they'll all be on different cores. The VM
> already runs maybe about 20-30 of them to begin with.
>
> You should see better concurrent behaviour if you were to do the test with
> something like 50 processes on 4 cores versus 50 processes on a single one.
>
> There are other variables coming into play here, but in general, more
> processes will make the behaviour better than fewer processes. The VM is
> meant to run thousands and thousands of processes concurrently in larger
> systems. Small benchmarks like this are likely to show weird results -- it's
> not exactly what Erlang would be optimized for.
>
> On Sat, May 15, 2010 at 8:50 AM, Roberto Ostinelli <roberto@REDACTED>wrote:
>
>> dear list,
>>
>> i'm performing some tests to optimize multicore usage. what i want to
>> understand is the best way to use the 4 cores that my machine has, in
>> a very simple erlang computation.
>>
>> i've written this module which basically spawns a very small number of
>> processes which do a trivial but intensive computation, and report to
>> a registered 'counter' process when done.
>>
>> here's the test module:
>>
>> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%
>>
>> -module(mtest).
>> -compile(export_all).
>> -define(NUMCOMPUTE, 100000000).
>> -define(NUMTIMES, 10).
>> -define(TIMEOUT, 60000).
>>
>> start(ProcNum) ->
>>        % start counter
>>        register(counter, spawn(fun() -> counter_loop(now(), ProcNum,
>> ProcNum) end)),
>>        % start compute processes
>>        [spawn(?MODULE, increase, [0, 0]) || _K <- lists:seq(1, ProcNum)].
>>
>> % computational processes
>> increase(?NUMCOMPUTE, ?NUMTIMES) -> counter ! finished;
>> increase(?NUMCOMPUTE, Times) -> increase(0, Times + 1);
>> increase(N, Times) -> increase(N + 1, Times).
>>
>> % counter loop
>> counter_loop(Start, 0, ProcNum) ->
>>        T = timer:now_diff(now(), Start),
>>        io:format("COMPUTED ADD TO ~p FOR ~p TIMES IN ~p PROCESS(ES) IN ~p
>> SECONDS~n", [?NUMCOMPUTE, ?NUMTIMES, ProcNum, T/1000000]);
>> counter_loop(Start, ProcsLeft, ProcNum) ->
>>        receive
>>                finished ->
>>                        counter_loop(Start, ProcsLeft - 1, ProcNum);
>>                _ ->
>>                        counter_loop(Start, ProcsLeft, ProcNum)
>>        after ?TIMEOUT -> timeout
>>        end.
>>
>> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%
>>
>>
>> when i run this module with kernel poll and smp enabled, i get:
>>
>>
>> roberto$ erl +K true
>> Erlang R13B04 (erts-5.7.5) [source] [smp:4:4] [rq:4] [async-threads:0]
>> [hipe] [kernel-poll:true]
>>
>> Eshell V5.7.5  (abort with ^G)
>> 1> c(mtest).
>> {ok,mtest}
>> 2> mtest:start(1).
>> [<0.40.0>]
>> COMPUTED ADD TO 100000000 FOR 10 TIMES IN 1 PROCESS(ES) IN 8.100901
>> SECONDS
>> 3> mtest:start(2).
>> [<0.43.0>,<0.44.0>]
>> COMPUTED ADD TO 100000000 FOR 10 TIMES IN 2 PROCESS(ES) IN 10.367364
>> SECONDS
>> 4> mtest:start(3).
>> [<0.47.0>,<0.48.0>,<0.49.0>]
>> COMPUTED ADD TO 100000000 FOR 10 TIMES IN 3 PROCESS(ES) IN 13.541443
>> SECONDS
>> 5> mtest:start(4).
>> [<0.52.0>,<0.53.0>,<0.54.0>,<0.55.0>]
>> COMPUTED ADD TO 100000000 FOR 10 TIMES IN 4 PROCESS(ES) IN 17.512607
>> SECONDS
>> 6>
>>
>>
>> i would have expected that, since 1 process running on 1 core takes
>> around 8 seconds, running the code on 4 processes [hence on 4 cores]
>> would have taken only marginally more time [for smp, startup, etc].
>> however, i see here that 4 cores are taking twice as much time that i
>> would have expected.
>>
>> in this kind of situations, as reported also in other topics, it seems
>> that using 4 erlang instances with smp disabled would definitely allow
>> me to run the very same test in 8 seconds, not in 17. i'm prepared to
>> loose a little for smp, which is normal, but adding twice up is
>> definitely a high cost.
>>
>> i would have thought that this kind of parallelism would have been
>> handled with no hassle in erlang: am i approaching this problem in a
>> wrong manner? what should i do instead to achieve my expected results?
>>
>> thank you,
>>
>> r.
>>
>> ________________________________________________________________
>> erlang-questions (at) erlang.org mailing list.
>> See http://www.erlang.org/faq.html
>> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>>
>>
>