[erlang-questions] Why using pmap is not giving me the desired speedup ?
Tushar Deshpande
tushar.erlang@REDACTED
Wed Nov 18 06:18:31 CET 2009
Hi Richard and Angel,
I appreciate your help a lot.
I must admit that I naively assumed concurrency
and parallelism to be same. I was underthe impression
that concurrent code would give me speedup.
Clearly, this is not the case.
I think that precisely for this reason, using MapReduce
makes a lot of sense.
I wrote a distributed sort using Joe Armstrong's
mapreduce implementation. I limited the number
of spawn tasks to the number of cores.
And, guess what, I got desired speedup. Distribted
sort performed almost 20 times faster for an array
with 100000 integers.
I'm a grad student from State University of New York
at Stony Brook and I'm studying Computer Architeture.
I'm quite sure that I would learn a lot about multicore
machines by writing some actual programs :)
Warm Ragards,
Tushar Deshpande
On Tue, Nov 17, 2009 at 6:50 PM, Richard O'Keefe <ok@REDACTED> wrote:
>
> On Nov 17, 2009, at 7:45 PM, Tushar Deshpande wrote:
>
>> I imagined that I would get some observable
>> speedup with parallel quicksort.
>>
>> Contrary to my expectations, parallel version
>> took more time to run.
>>
>
>
>> What could be the reason ?
>>
>
> Why is parallel quicksort effective on shared memory multiprocessors?
> Because there is an O(n) in-place partitioning, following by sparking
> two activities, which need to be told only "where is the first element
> of my sublist" and "how many elements" (2 words), and which need no
> synchronisation between them, the only synchronisation being for the
> parent to wait for its two children. Indeed, we only need one child.
>
> psort(a, n) =
> if n is small then
> use insertion sort
> else
> choose pivot x
> partition (a,n) into (a,L), (a+L,E), (a+L+E,G)
> where L is the number of elements < x, E the
> number of elements = x, G the number > x.
> child = fork psort(a+L+E,G)
> psort(a,L)
> wait for child
> end if
>
> Tushar Deshpande's parallel quicksort, thanks to its use of
> pmap, creates two children, when it only needs to create one.
>
> Not only that, the left and right sublists have to be *sent*
> to each child, that's O(n) copying, the sorted versions have
> to be *received* from each child, that's O(n) copying again,
> and then they have to be concatenated. The sequential version
> has only to concatenate.
>
> It would be interesting to try
>
> pqsortB(L) ->
> pqsortB(L, length(L)).
>
> pqsortB(L, N) when N < 10 -> % tunable threshold
> lists:sort(L);
> pqsortB([Pivot|L], _) ->
> ppartitionB(L, Pivot, [], 0, [Pivot], [], 0).
>
> ppartitionB([], _, Lss, Nlss, Eql, Gtr, Ngtr) ->
> Self = self(),
> Pid = spawn(fun () -> Self!{self(),pqsortB(Gtr, Ngtr)} end),
> Left = pqsortB(Lss, Nlss),
> receive {Pid,Right} -> Left++(Eql++Right) end;
> ppartitionB([X|Xs], Pivot, Lss, Nlss, Eql, Gtr, Ngtr) ->
> if X < Pivot -> ppartitionB(Xs, Pivot, [X|Lss], Nlss+1, Eql, Gtr, Ngtr)
> ; X > Pivot -> ppartitionB(Xs, Pivot, Lss, Nlss, Eql, [X|Gtr], Ngtr+1)
> ; true -> ppartitionB(Xs, Pivot, Lss, Nlss, [X|Eql], Gtr, Ngtr)
> end.
>
> In fact it seemed interesting enough that I *did* try it.
> I measured the time to sort a list of 10,000 uniformly generated
> random floats, on a machine with a 500MHz CPU and 84 MHz memory.
> All times in this message are the smallest of 5 runs.
>
> 120 msec Tushar Deshpande's qsort
> 110 msec Sequential qsort with single partition pass as above
> and similarly switching over to lists:qsort for small N.
> 110 msec Same as above but not bothering with counts, checking
> for large -vs- small by [Pivot|L = [_,_,_|_]] or not
> and sorting 0 to 3 elements inline.
> 80 msec Same as previous, but qsort(L, T) == qsort(L)++T, to
> try to reduce the amount of concatenation.
>
> 930 msec Tushar Deshpande's pqsort (with a simplified version of
> pmap/2 that doesn't bother with error handling)
> 230 msec The pqsortB/1 function above.
>
> With Tushar Deshpande's pqsort/1 being 7.75 times slower than qsort/1,
> it's going to take a lot of cores just to catch up. My improved
> pqsortB is 4 times faster than his on a single core.
>
> I tried again with a 2.2GHz CPU, 667 MHz memory, 2-core machine.
> This time I needed 100,000 element lists to get useful times.
>
> 170 msec original qsort
> 120 msec
> 110 msec
> 100 msec best qsort from above
>
> -smp disable -smp enable
> 830 msec 1720 msec original pqsort
> 240 msec 310 msec improved pqsortB
> 110 msec 110 msec partition once, sort sublists
> using one child, combine.
>
> As a double check, with -smp disable, the emulator reported
> [rq:1] in the banner; with -smp enable, [smp:2:2] [rq:2].
>
> Here's what I *suspect* is going on. The time ratios for
> the sequential versions are very close to the speed ratios
> of the main memories on the two machines. It looks as
> though memory is the main bottleneck.
>
> Now my laptop may have two cores, but they share one main
> memory. (And I think they share an L2 cache.) If the time
> of a computation is dominated by memory, dividing the work
> across several CPUs isn't going to help: you need to divide
> it across several memories.
>
> Another key point is that message passing is *copying*.
> It seems clear from the "-smp enable" column above that
> "the fewer the processes the better", which in this case
> actually means "the less COPYING of data between processes
> the better".
>
> -smp disable -smp enable #processes #cells copied
> 830 msec 1720 msec 32766 2800000
> 240 msec 310 msec 16383 1406112
> 110 msec 110 msec 1 100000
>
> Here a "cell" means a list cell whose head is a float copied
> in a message from one process to another. #processes means
> the number of *additional* (temporary) processes.
>
> We have one mechanism in Erlang for both modelling concurrent
> systems and obtaining parallel speedups: logically separate
> processes communicating by message passing. In the former
> case, you create (at least) one Erlang process per logical
> activity, NOT for speed but for CLARITY, so that each logical
> activity can be modelled separately from the others. The
> deciding factor in the number of processes you create is the
> number of things to be modelled.
>
> When you are trying to get parallelism, there is seldom any
> point in having more than a small number of times as many
> processes as you have available cores. If you have lots of
> processes all wanting to be active at the same time, the
> costs of scheduling and message passing are likely to be high
> relative to the useful work. The best parallel quicksort
> above just used two processes (one the original, the other
> a temporary child). On a quad core machine, four processes
> would make sense.
>
>
>
> ________________________________________________________________
> erlang-questions mailing list. See http://www.erlang.org/faq.html
> erlang-questions (at) erlang.org
>
>
More information about the erlang-questions
mailing list