[erlang-questions] Why using pmap is not giving me the desired speedup ?
Richard O'Keefe
ok@REDACTED
Wed Nov 18 00:50:21 CET 2009
On Nov 17, 2009, at 7:45 PM, Tushar Deshpande wrote:
> I imagined that I would get some observable
> speedup with parallel quicksort.
>
> Contrary to my expectations, parallel version
> took more time to run.
>
> What could be the reason ?
Why is parallel quicksort effective on shared memory multiprocessors?
Because there is an O(n) in-place partitioning, following by sparking
two activities, which need to be told only "where is the first element
of my sublist" and "how many elements" (2 words), and which need no
synchronisation between them, the only synchronisation being for the
parent to wait for its two children. Indeed, we only need one child.
psort(a, n) =
if n is small then
use insertion sort
else
choose pivot x
partition (a,n) into (a,L), (a+L,E), (a+L+E,G)
where L is the number of elements < x, E the
number of elements = x, G the number > x.
child = fork psort(a+L+E,G)
psort(a,L)
wait for child
end if
Tushar Deshpande's parallel quicksort, thanks to its use of
pmap, creates two children, when it only needs to create one.
Not only that, the left and right sublists have to be *sent*
to each child, that's O(n) copying, the sorted versions have
to be *received* from each child, that's O(n) copying again,
and then they have to be concatenated. The sequential version
has only to concatenate.
It would be interesting to try
pqsortB(L) ->
pqsortB(L, length(L)).
pqsortB(L, N) when N < 10 -> % tunable threshold
lists:sort(L);
pqsortB([Pivot|L], _) ->
ppartitionB(L, Pivot, [], 0, [Pivot], [], 0).
ppartitionB([], _, Lss, Nlss, Eql, Gtr, Ngtr) ->
Self = self(),
Pid = spawn(fun () -> Self!{self(),pqsortB(Gtr, Ngtr)} end),
Left = pqsortB(Lss, Nlss),
receive {Pid,Right} -> Left++(Eql++Right) end;
ppartitionB([X|Xs], Pivot, Lss, Nlss, Eql, Gtr, Ngtr) ->
if X < Pivot -> ppartitionB(Xs, Pivot, [X|Lss], Nlss+1, Eql, Gtr,
Ngtr)
; X > Pivot -> ppartitionB(Xs, Pivot, Lss, Nlss, Eql, [X|Gtr],
Ngtr+1)
; true -> ppartitionB(Xs, Pivot, Lss, Nlss, [X|Eql], Gtr, Ngtr)
end.
In fact it seemed interesting enough that I *did* try it.
I measured the time to sort a list of 10,000 uniformly generated
random floats, on a machine with a 500MHz CPU and 84 MHz memory.
All times in this message are the smallest of 5 runs.
120 msec Tushar Deshpande's qsort
110 msec Sequential qsort with single partition pass as above
and similarly switching over to lists:qsort for small N.
110 msec Same as above but not bothering with counts, checking
for large -vs- small by [Pivot|L = [_,_,_|_]] or not
and sorting 0 to 3 elements inline.
80 msec Same as previous, but qsort(L, T) == qsort(L)++T, to
try to reduce the amount of concatenation.
930 msec Tushar Deshpande's pqsort (with a simplified version of
pmap/2 that doesn't bother with error handling)
230 msec The pqsortB/1 function above.
With Tushar Deshpande's pqsort/1 being 7.75 times slower than qsort/1,
it's going to take a lot of cores just to catch up. My improved
pqsortB is 4 times faster than his on a single core.
I tried again with a 2.2GHz CPU, 667 MHz memory, 2-core machine.
This time I needed 100,000 element lists to get useful times.
170 msec original qsort
120 msec
110 msec
100 msec best qsort from above
-smp disable -smp enable
830 msec 1720 msec original pqsort
240 msec 310 msec improved pqsortB
110 msec 110 msec partition once, sort sublists
using one child, combine.
As a double check, with -smp disable, the emulator reported
[rq:1] in the banner; with -smp enable, [smp:2:2] [rq:2].
Here's what I *suspect* is going on. The time ratios for
the sequential versions are very close to the speed ratios
of the main memories on the two machines. It looks as
though memory is the main bottleneck.
Now my laptop may have two cores, but they share one main
memory. (And I think they share an L2 cache.) If the time
of a computation is dominated by memory, dividing the work
across several CPUs isn't going to help: you need to divide
it across several memories.
Another key point is that message passing is *copying*.
It seems clear from the "-smp enable" column above that
"the fewer the processes the better", which in this case
actually means "the less COPYING of data between processes
the better".
-smp disable -smp enable #processes #cells copied
830 msec 1720 msec 32766 2800000
240 msec 310 msec 16383 1406112
110 msec 110 msec 1 100000
Here a "cell" means a list cell whose head is a float copied
in a message from one process to another. #processes means
the number of *additional* (temporary) processes.
We have one mechanism in Erlang for both modelling concurrent
systems and obtaining parallel speedups: logically separate
processes communicating by message passing. In the former
case, you create (at least) one Erlang process per logical
activity, NOT for speed but for CLARITY, so that each logical
activity can be modelled separately from the others. The
deciding factor in the number of processes you create is the
number of things to be modelled.
When you are trying to get parallelism, there is seldom any
point in having more than a small number of times as many
processes as you have available cores. If you have lots of
processes all wanting to be active at the same time, the
costs of scheduling and message passing are likely to be high
relative to the useful work. The best parallel quicksort
above just used two processes (one the original, the other
a temporary child). On a quad core machine, four processes
would make sense.
More information about the erlang-questions
mailing list