[erlang-questions] Why using pmap is not giving me the desired speedup ?

Wed Nov 18 06:18:31 CET 2009

Hi Richard and Angel,
I appreciate your help a lot.
I must admit that I naively assumed concurrency
and parallelism to be same. I was underthe impression
that concurrent code would give me speedup.
Clearly, this is not the case.
I think that precisely for this reason, using MapReduce
makes a lot of sense.
I wrote a distributed sort using Joe Armstrong's
mapreduce implementation. I limited the number
of spawn tasks to the number of cores.
And, guess what, I got desired speedup. Distribted
sort performed almost 20 times faster for an array
with 100000 integers.
I'm a grad student from State University of New York
at Stony Brook and I'm studying Computer Architeture.
I'm quite sure that I would learn a lot about multicore
machines by writing some actual programs :)
Warm Ragards,
Tushar Deshpande
On Tue, Nov 17, 2009 at 6:50 PM, Richard O'Keefe <ok@REDACTED> wrote:

>
> On Nov 17, 2009, at 7:45 PM, Tushar Deshpande wrote:
>
>> I imagined that I would get some observable
>> speedup with parallel quicksort.
>>
>> Contrary to my expectations, parallel version
>> took more time to run.
>>
>
>
>> What could be the reason ?
>>
>
> Why is parallel quicksort effective on shared memory multiprocessors?
> Because there is an O(n) in-place partitioning, following by sparking
> two activities, which need to be told only "where is the first element
> of my sublist" and "how many elements" (2 words), and which need no
> synchronisation between them, the only synchronisation being for the
> parent to wait for its two children.  Indeed, we only need one child.
>
>        psort(a, n) =
>            if n is small then
>                use insertion sort
>            else
>                choose pivot x
>                partition (a,n) into (a,L), (a+L,E), (a+L+E,G)
>                where L is the number of elements < x, E the
>                number of elements = x, G the number > x.
>                child = fork psort(a+L+E,G)
>                psort(a,L)
>                wait for child
>            end if
>
> Tushar Deshpande's parallel quicksort, thanks to its use of
> pmap, creates two children, when it only needs to create one.
>
> Not only that, the left and right sublists have to be *sent*
> to each child, that's O(n) copying, the sorted versions have
> to be *received* from each child, that's O(n) copying again,
> and then they have to be concatenated.  The sequential version
> has only to concatenate.
>
> It would be interesting to try
>
> pqsortB(L) ->
>    pqsortB(L, length(L)).
>
> pqsortB(L, N) when N < 10 -> % tunable threshold
>    lists:sort(L);
> pqsortB([Pivot|L], _) ->
>    ppartitionB(L, Pivot, [], 0, [Pivot], [], 0).
>
> ppartitionB([], _, Lss, Nlss, Eql, Gtr, Ngtr) ->
>    Self = self(),
>    Pid = spawn(fun () -> Self!{self(),pqsortB(Gtr, Ngtr)} end),
>    Left = pqsortB(Lss, Nlss),
>    receive {Pid,Right} -> Left++(Eql++Right) end;
> ppartitionB([X|Xs], Pivot, Lss, Nlss, Eql, Gtr, Ngtr) ->
>    if X < Pivot -> ppartitionB(Xs, Pivot, [X|Lss], Nlss+1, Eql, Gtr, Ngtr)
>     ; X > Pivot -> ppartitionB(Xs, Pivot, Lss, Nlss, Eql, [X|Gtr], Ngtr+1)
>     ; true -> ppartitionB(Xs, Pivot, Lss, Nlss, [X|Eql], Gtr, Ngtr)
>    end.
>
> In fact it seemed interesting enough that I *did* try it.
> I measured the time to sort a list of 10,000 uniformly generated
> random floats, on a machine with a 500MHz CPU and 84 MHz memory.
> All times in this message are the smallest of 5 runs.
>
> 120 msec        Tushar Deshpande's qsort
> 110 msec        Sequential qsort with single partition pass as above
>                and similarly switching over to lists:qsort for small N.
> 110 msec        Same as above but not bothering with counts, checking
>                for large -vs- small by [Pivot|L = [_,_,_|_]] or not
>                and sorting 0 to 3 elements inline.
>  80 msec        Same as previous, but qsort(L, T) == qsort(L)++T, to
>                try to reduce the amount of concatenation.
>
> 930 msec        Tushar Deshpande's pqsort (with a simplified version of
>                pmap/2 that doesn't bother with error handling)
> 230 msec        The pqsortB/1 function above.
>
> With Tushar Deshpande's pqsort/1 being 7.75 times slower than qsort/1,
> it's going to take a lot of cores just to catch up.  My improved
> pqsortB is 4 times faster than his on a single core.
>
> I tried again with a 2.2GHz CPU, 667 MHz memory, 2-core machine.
> This time I needed 100,000 element lists to get useful times.
>
>  170 msec       original qsort
>  120 msec
>  110 msec
>  100 msec       best qsort from above
>
> -smp disable    -smp enable
>  830 msec       1720 msec       original pqsort
>  240 msec        310 msec       improved pqsortB
>  110 msec        110 msec       partition once, sort sublists
>                                using one child, combine.
>
> As a double check, with -smp disable, the emulator reported
> [rq:1] in the banner; with -smp enable, [smp:2:2] [rq:2].
>
> Here's what I *suspect* is going on.  The time ratios for
> the sequential versions are very close to the speed ratios
> of the main memories on the two machines.  It looks as
> though memory is the main bottleneck.
>
> Now my laptop may have two cores, but they share one main
> memory.  (And I think they share an L2 cache.)  If the time
> of a computation is dominated by memory, dividing the work
> across several CPUs isn't going to help: you need to divide
> it across several memories.
>
> Another key point is that message passing is *copying*.
> It seems clear from the "-smp enable" column above that
> "the fewer the processes the better", which in this case
> actually means "the less COPYING of data between processes
> the better".
>
> -smp disable    -smp enable     #processes      #cells copied
>  830 msec       1720 msec       32766           2800000
>  240 msec        310 msec       16383           1406112
>  110 msec        110 msec       1                100000
>
> Here a "cell" means a list cell whose head is a float copied
> in a message from one process to another.  #processes means
> the number of *additional* (temporary) processes.
>
> We have one mechanism in Erlang for both modelling concurrent
> systems and obtaining parallel speedups:  logically separate
> processes communicating by message passing.  In the former
> case, you create (at least) one Erlang process per logical
> activity, NOT for speed but for CLARITY, so that each logical
> activity can be modelled separately from the others.  The
> deciding factor in the number of processes you create is the
> number of things to be modelled.
>
> When you are trying to get parallelism, there is seldom any
> point in having more than a small number of times as many
> processes as you have available cores.  If you have lots of
> processes all wanting to be active at the same time, the
> costs of scheduling and message passing are likely to be high
> relative to the useful work.  The best parallel quicksort
> above just used two processes (one the original, the other
> a temporary child).  On a quad core machine, four processes
> would make sense.
>
>
>
> ________________________________________________________________
> erlang-questions mailing list. See http://www.erlang.org/faq.html
> erlang-questions (at) erlang.org
>
>