[erlang-questions] Why using pmap is not giving me the desired speedup ?

Angel Alvarez clist@REDACTED
Thu Nov 19 19:54:48 CET 2009


Great Post!!!

Very educative, with the threshold and the "one child optimization".

Thanks 


El Miércoles, 18 de Noviembre de 2009 Richard O'Keefe escribió:
> 
> On Nov 17, 2009, at 7:45 PM, Tushar Deshpande wrote:
> > I imagined that I would get some observable
> > speedup with parallel quicksort.
> >
> > Contrary to my expectations, parallel version
> > took more time to run.
> 
> >
> > What could be the reason ?
> 
> Why is parallel quicksort effective on shared memory multiprocessors?
> Because there is an O(n) in-place partitioning, following by sparking
> two activities, which need to be told only "where is the first element
> of my sublist" and "how many elements" (2 words), and which need no
> synchronisation between them, the only synchronisation being for the
> parent to wait for its two children.  Indeed, we only need one child.
> 
> 	psort(a, n) =
> 	    if n is small then
> 		use insertion sort
> 	    else
> 		choose pivot x
> 		partition (a,n) into (a,L), (a+L,E), (a+L+E,G)
> 		where L is the number of elements < x, E the
> 		number of elements = x, G the number > x.
> 		child = fork psort(a+L+E,G)
> 		psort(a,L)
> 		wait for child
> 	    end if
> 
> Tushar Deshpande's parallel quicksort, thanks to its use of
> pmap, creates two children, when it only needs to create one.
> 
> Not only that, the left and right sublists have to be *sent*
> to each child, that's O(n) copying, the sorted versions have
> to be *received* from each child, that's O(n) copying again,
> and then they have to be concatenated.  The sequential version
> has only to concatenate.
> 
> It would be interesting to try
> 
> pqsortB(L) ->
>      pqsortB(L, length(L)).
> 
> pqsortB(L, N) when N < 10 -> % tunable threshold
>      lists:sort(L);
> pqsortB([Pivot|L], _) ->
>      ppartitionB(L, Pivot, [], 0, [Pivot], [], 0).
> 
> ppartitionB([], _, Lss, Nlss, Eql, Gtr, Ngtr) ->
>      Self = self(),
>      Pid = spawn(fun () -> Self!{self(),pqsortB(Gtr, Ngtr)} end),
>      Left = pqsortB(Lss, Nlss),
>      receive {Pid,Right} -> Left++(Eql++Right) end;
> ppartitionB([X|Xs], Pivot, Lss, Nlss, Eql, Gtr, Ngtr) ->
>      if X < Pivot -> ppartitionB(Xs, Pivot, [X|Lss], Nlss+1, Eql, Gtr,  
> Ngtr)
>       ; X > Pivot -> ppartitionB(Xs, Pivot, Lss, Nlss, Eql, [X|Gtr],  
> Ngtr+1)
>       ; true -> ppartitionB(Xs, Pivot, Lss, Nlss, [X|Eql], Gtr, Ngtr)
>      end.
> 
> In fact it seemed interesting enough that I *did* try it.
> I measured the time to sort a list of 10,000 uniformly generated
> random floats, on a machine with a 500MHz CPU and 84 MHz memory.
> All times in this message are the smallest of 5 runs.
> 
> 120 msec	Tushar Deshpande's qsort
> 110 msec	Sequential qsort with single partition pass as above
> 		and similarly switching over to lists:qsort for small N.
> 110 msec	Same as above but not bothering with counts, checking
> 		for large -vs- small by [Pivot|L = [_,_,_|_]] or not
> 		and sorting 0 to 3 elements inline.
>   80 msec	Same as previous, but qsort(L, T) == qsort(L)++T, to
> 		try to reduce the amount of concatenation.
> 
> 930 msec	Tushar Deshpande's pqsort (with a simplified version of
> 		pmap/2 that doesn't bother with error handling)
> 230 msec	The pqsortB/1 function above.
> 
> With Tushar Deshpande's pqsort/1 being 7.75 times slower than qsort/1,
> it's going to take a lot of cores just to catch up.  My improved
> pqsortB is 4 times faster than his on a single core.
> 
> I tried again with a 2.2GHz CPU, 667 MHz memory, 2-core machine.
> This time I needed 100,000 element lists to get useful times.
> 
>   170 msec	original qsort
>   120 msec
>   110 msec
>   100 msec	best qsort from above
> 
> -smp disable	-smp enable
>   830 msec       1720 msec	original pqsort
>   240 msec	 310 msec	improved pqsortB
>   110 msec        110 msec       partition once, sort sublists
> 				using one child, combine.
> 
> As a double check, with -smp disable, the emulator reported
> [rq:1] in the banner; with -smp enable, [smp:2:2] [rq:2].
> 
> Here's what I *suspect* is going on.  The time ratios for
> the sequential versions are very close to the speed ratios
> of the main memories on the two machines.  It looks as
> though memory is the main bottleneck.
> 
> Now my laptop may have two cores, but they share one main
> memory.  (And I think they share an L2 cache.)  If the time
> of a computation is dominated by memory, dividing the work
> across several CPUs isn't going to help: you need to divide
> it across several memories.
> 
> Another key point is that message passing is *copying*.
> It seems clear from the "-smp enable" column above that
> "the fewer the processes the better", which in this case
> actually means "the less COPYING of data between processes
> the better".
> 
> -smp disable	-smp enable	#processes	#cells copied
>   830 msec	1720 msec	32766		2800000
>   240 msec	 310 msec	16383		1406112
>   110 msec	 110 msec	1		 100000
> 
> Here a "cell" means a list cell whose head is a float copied
> in a message from one process to another.  #processes means
> the number of *additional* (temporary) processes.
> 
> We have one mechanism in Erlang for both modelling concurrent
> systems and obtaining parallel speedups:  logically separate
> processes communicating by message passing.  In the former
> case, you create (at least) one Erlang process per logical
> activity, NOT for speed but for CLARITY, so that each logical
> activity can be modelled separately from the others.  The
> deciding factor in the number of processes you create is the
> number of things to be modelled.
> 
> When you are trying to get parallelism, there is seldom any
> point in having more than a small number of times as many
> processes as you have available cores.  If you have lots of
> processes all wanting to be active at the same time, the
> costs of scheduling and message passing are likely to be high
> relative to the useful work.  The best parallel quicksort
> above just used two processes (one the original, the other
> a temporary child).  On a quad core machine, four processes
> would make sense.
> 
> 
> ________________________________________________________________
> erlang-questions mailing list. See http://www.erlang.org/faq.html
> erlang-questions (at) erlang.org
> 
> 



-- 
No imprima este correo si no es necesario. El medio ambiente está en nuestras manos.
->>-----------------------------------------------
    Clist UAH a.k.a Angel
---------------------------------[www.uah.es]-<<--

*J2EE es una arquitectura enorme, pesada y complicada diseñada especialmente para que los departamentos de sistemas ¿? justifiquen sus gastos y los gerentes puedan gastarse sus millones en IBM y ORACLE


More information about the erlang-questions mailing list