[erlang-questions] Why using pmap is not giving me the desired speedup ?

Wed Nov 18 00:50:21 CET 2009

On Nov 17, 2009, at 7:45 PM, Tushar Deshpande wrote:
> I imagined that I would get some observable
> speedup with parallel quicksort.
>
> Contrary to my expectations, parallel version
> took more time to run.

>
> What could be the reason ?

Why is parallel quicksort effective on shared memory multiprocessors?
Because there is an O(n) in-place partitioning, following by sparking
two activities, which need to be told only "where is the first element
of my sublist" and "how many elements" (2 words), and which need no
synchronisation between them, the only synchronisation being for the
parent to wait for its two children.  Indeed, we only need one child.

	psort(a, n) =
	    if n is small then
		use insertion sort
	    else
		choose pivot x
		partition (a,n) into (a,L), (a+L,E), (a+L+E,G)
		where L is the number of elements < x, E the
		number of elements = x, G the number > x.
		child = fork psort(a+L+E,G)
		psort(a,L)
		wait for child
	    end if

Tushar Deshpande's parallel quicksort, thanks to its use of
pmap, creates two children, when it only needs to create one.

Not only that, the left and right sublists have to be *sent*
to each child, that's O(n) copying, the sorted versions have
to be *received* from each child, that's O(n) copying again,
and then they have to be concatenated.  The sequential version
has only to concatenate.

It would be interesting to try

pqsortB(L) ->
     pqsortB(L, length(L)).

pqsortB(L, N) when N < 10 -> % tunable threshold
     lists:sort(L);
pqsortB([Pivot|L], _) ->
     ppartitionB(L, Pivot, [], 0, [Pivot], [], 0).

ppartitionB([], _, Lss, Nlss, Eql, Gtr, Ngtr) ->
     Self = self(),
     Pid = spawn(fun () -> Self!{self(),pqsortB(Gtr, Ngtr)} end),
     Left = pqsortB(Lss, Nlss),
     receive {Pid,Right} -> Left++(Eql++Right) end;
ppartitionB([X|Xs], Pivot, Lss, Nlss, Eql, Gtr, Ngtr) ->
     if X < Pivot -> ppartitionB(Xs, Pivot, [X|Lss], Nlss+1, Eql, Gtr,  
Ngtr)
      ; X > Pivot -> ppartitionB(Xs, Pivot, Lss, Nlss, Eql, [X|Gtr],  
Ngtr+1)
      ; true -> ppartitionB(Xs, Pivot, Lss, Nlss, [X|Eql], Gtr, Ngtr)
     end.

In fact it seemed interesting enough that I *did* try it.
I measured the time to sort a list of 10,000 uniformly generated
random floats, on a machine with a 500MHz CPU and 84 MHz memory.
All times in this message are the smallest of 5 runs.

120 msec	Tushar Deshpande's qsort
110 msec	Sequential qsort with single partition pass as above
		and similarly switching over to lists:qsort for small N.
110 msec	Same as above but not bothering with counts, checking
		for large -vs- small by [Pivot|L = [_,_,_|_]] or not
		and sorting 0 to 3 elements inline.
  80 msec	Same as previous, but qsort(L, T) == qsort(L)++T, to
		try to reduce the amount of concatenation.

930 msec	Tushar Deshpande's pqsort (with a simplified version of
		pmap/2 that doesn't bother with error handling)
230 msec	The pqsortB/1 function above.

With Tushar Deshpande's pqsort/1 being 7.75 times slower than qsort/1,
it's going to take a lot of cores just to catch up.  My improved
pqsortB is 4 times faster than his on a single core.

I tried again with a 2.2GHz CPU, 667 MHz memory, 2-core machine.
This time I needed 100,000 element lists to get useful times.

  170 msec	original qsort
  120 msec
  110 msec
  100 msec	best qsort from above

-smp disable	-smp enable
  830 msec       1720 msec	original pqsort
  240 msec	 310 msec	improved pqsortB
  110 msec        110 msec       partition once, sort sublists
				using one child, combine.

As a double check, with -smp disable, the emulator reported
[rq:1] in the banner; with -smp enable, [smp:2:2] [rq:2].

Here's what I *suspect* is going on.  The time ratios for
the sequential versions are very close to the speed ratios
of the main memories on the two machines.  It looks as
though memory is the main bottleneck.

Now my laptop may have two cores, but they share one main
memory.  (And I think they share an L2 cache.)  If the time
of a computation is dominated by memory, dividing the work
across several CPUs isn't going to help: you need to divide
it across several memories.

Another key point is that message passing is *copying*.
It seems clear from the "-smp enable" column above that
"the fewer the processes the better", which in this case
actually means "the less COPYING of data between processes
the better".

-smp disable	-smp enable	#processes	#cells copied
  830 msec	1720 msec	32766		2800000
  240 msec	 310 msec	16383		1406112
  110 msec	 110 msec	1		 100000

Here a "cell" means a list cell whose head is a float copied
in a message from one process to another.  #processes means
the number of *additional* (temporary) processes.

We have one mechanism in Erlang for both modelling concurrent
systems and obtaining parallel speedups:  logically separate
processes communicating by message passing.  In the former
case, you create (at least) one Erlang process per logical
activity, NOT for speed but for CLARITY, so that each logical
activity can be modelled separately from the others.  The
deciding factor in the number of processes you create is the
number of things to be modelled.

When you are trying to get parallelism, there is seldom any
point in having more than a small number of times as many
processes as you have available cores.  If you have lots of
processes all wanting to be active at the same time, the
costs of scheduling and message passing are likely to be high
relative to the useful work.  The best parallel quicksort
above just used two processes (one the original, the other
a temporary child).  On a quad core machine, four processes
would make sense.