Michael Turner <>
Wed Nov 18 11:10:50 CET 2009

Some years ago, I attended a talk by Doug Clark about upcall
implementations of IP.  He stressed that the most important numbers you
should know, when going into optimization of protocols at that level,
were context-switch times and memory-copy times.  Brutally basic
operations, but both so important and frequent that they enjoy some
degree of instruction-set-level support in any decent architecture.  In
some ways, context-switch costs are dominated by memory-copy times  --
saving current state in memory, restoring interrupted state from memory
-- so in a way, it almost comes down to the cost of copying.  Quicksort
is great, if you can do it in-place.  But if you have to do O(n log n)
copies of the data, well ....

The discussion here reminds me that issues like these are still important
in optimization.  Which means you should always be revisiting the
question: how important is optimization?

The last time I had any serious issue involving quicksort, it was because
a junior software engineer, not realizing there was a UNIX library
function qsort(3), had translated some Pascal textbook quicksort to C. 
Not being very expert in C, he'd written unportable code, which
segfaulted on a new platform.  Upon investigating the issue, I
discovered two things:

 (1) I could replace his two-dozen-lines-or-so lines of quicksort
implementation/binary-search-lookup with just two lines, using library
calls, BUT --

 (2) the array he was sorting wasn't worth sorting anyway, because in
real-world situations, nobody expected the list to get more than four or
five items deep.  Linear search of an unsorted array would have been
faster than binary search on any sorted one, even without an unrolled
loop or sentinel.

He'd actually slowed down the system (slightly), he had taken time to
implement this "pessimization" (expensive), and he had made systems
crash (more expensive still), which required that somebody (me) debug
his code (which was even more expensive than his original
reimplementation of quicksort, since he was a junior employee and I was
a contractor working at a higher hourly rate.)

I really wish I was making this story up, but it did happen.  Premature
optimization truly is the root of all evil.  Don't get me wrong, I love
that Erlang two-liner version of quicksort.  But it would be a service
to the world if, in future editions of books and documents mentioning
it, there was also a footnote pointing out that people should use
quicksort as an optimization, only if necessary, and that Erlang will
tend to execute this particular quicksort rather suboptimally.  The cool
thing about the two-liner is that it's a surprisingly clear expression
of sorting, not that it's a good implementation in any other respect.

-michael turner

On 11/17/2009, "Richard O'Keefe" <> wrote:

>On Nov 17, 2009, at 7:45 PM, Tushar Deshpande wrote:
>> I imagined that I would get some observable
>> speedup with parallel quicksort.
>> Contrary to my expectations, parallel version
>> took more time to run.
>> What could be the reason ?
>Why is parallel quicksort effective on shared memory multiprocessors?
>Because there is an O(n) in-place partitioning, following by sparking
>two activities, which need to be told only "where is the first element
>of my sublist" and "how many elements" (2 words), and which need no
>synchronisation between them, the only synchronisation being for the
>parent to wait for its two children.  Indeed, we only need one child.
>	psort(a, n) =
>	    if n is small then
>		use insertion sort
>	    else
>		choose pivot x
>		partition (a,n) into (a,L), (a+L,E), (a+L+E,G)
>		where L is the number of elements < x, E the
>		number of elements = x, G the number > x.
>		child = fork psort(a+L+E,G)
>		psort(a,L)
>		wait for child
>	    end if
>Tushar Deshpande's parallel quicksort, thanks to its use of
>pmap, creates two children, when it only needs to create one.
>Not only that, the left and right sublists have to be *sent*
>to each child, that's O(n) copying, the sorted versions have
>to be *received* from each child, that's O(n) copying again,
>and then they have to be concatenated.  The sequential version
>has only to concatenate.
>It would be interesting to try
>pqsortB(L) ->
>     pqsortB(L, length(L)).
>pqsortB(L, N) when N < 10 -> % tunable threshold
>     lists:sort(L);
>pqsortB([Pivot|L], _) ->
>     ppartitionB(L, Pivot, [], 0, [Pivot], [], 0).
>ppartitionB([], _, Lss, Nlss, Eql, Gtr, Ngtr) ->
>     Self = self(),
>     Pid = spawn(fun () -> Self!{self(),pqsortB(Gtr, Ngtr)} end),
>     Left = pqsortB(Lss, Nlss),
>     receive {Pid,Right} -> Left++(Eql++Right) end;
>ppartitionB([X|Xs], Pivot, Lss, Nlss, Eql, Gtr, Ngtr) ->
>     if X < Pivot -> ppartitionB(Xs, Pivot, [X|Lss], Nlss+1, Eql, Gtr,
>      ; X > Pivot -> ppartitionB(Xs, Pivot, Lss, Nlss, Eql, [X|Gtr],
>      ; true -> ppartitionB(Xs, Pivot, Lss, Nlss, [X|Eql], Gtr, Ngtr)
>     end.
>In fact it seemed interesting enough that I *did* try it.
>I measured the time to sort a list of 10,000 uniformly generated
>random floats, on a machine with a 500MHz CPU and 84 MHz memory.
>All times in this message are the smallest of 5 runs.
>120 msec	Tushar Deshpande's qsort
>110 msec	Sequential qsort with single partition pass as above
>		and similarly switching over to lists:qsort for small N.
>110 msec	Same as above but not bothering with counts, checking
>		for large -vs- small by [Pivot|L = [_,_,_|_]] or not
>		and sorting 0 to 3 elements inline.
>  80 msec	Same as previous, but qsort(L, T) == qsort(L)++T, to
>		try to reduce the amount of concatenation.
>930 msec	Tushar Deshpande's pqsort (with a simplified version of
>		pmap/2 that doesn't bother with error handling)
>230 msec	The pqsortB/1 function above.
>With Tushar Deshpande's pqsort/1 being 7.75 times slower than qsort/1,
>it's going to take a lot of cores just to catch up.  My improved
>pqsortB is 4 times faster than his on a single core.
>I tried again with a 2.2GHz CPU, 667 MHz memory, 2-core machine.
>This time I needed 100,000 element lists to get useful times.
>  170 msec	original qsort
>  120 msec
>  110 msec
>  100 msec	best qsort from above
>-smp disable	-smp enable
>  830 msec       1720 msec	original pqsort
>  240 msec	 310 msec	improved pqsortB
>  110 msec        110 msec       partition once, sort sublists
>				using one child, combine.
>As a double check, with -smp disable, the emulator reported
>[rq:1] in the banner; with -smp enable, [smp:2:2] [rq:2].
>Here's what I *suspect* is going on.  The time ratios for
>the sequential versions are very close to the speed ratios
>of the main memories on the two machines.  It looks as
>though memory is the main bottleneck.
>Now my laptop may have two cores, but they share one main
>memory.  (And I think they share an L2 cache.)  If the time
>of a computation is dominated by memory, dividing the work
>across several CPUs isn't going to help: you need to divide
>it across several memories.
>Another key point is that message passing is *copying*.
>It seems clear from the "-smp enable" column above that
>"the fewer the processes the better", which in this case
>actually means "the less COPYING of data between processes
>the better".
>-smp disable	-smp enable	#processes	#cells copied
>  830 msec	1720 msec	32766		2800000
>  240 msec	 310 msec	16383		1406112
>  110 msec	 110 msec	1		 100000
>Here a "cell" means a list cell whose head is a float copied
>in a message from one process to another.  #processes means
>the number of *additional* (temporary) processes.
>We have one mechanism in Erlang for both modelling concurrent
>systems and obtaining parallel speedups:  logically separate
>processes communicating by message passing.  In the former
>case, you create (at least) one Erlang process per logical
>activity, NOT for speed but for CLARITY, so that each logical
>activity can be modelled separately from the others.  The
>deciding factor in the number of processes you create is the
>number of things to be modelled.
>When you are trying to get parallelism, there is seldom any
>point in having more than a small number of times as many
>processes as you have available cores.  If you have lots of
>processes all wanting to be active at the same time, the
>costs of scheduling and message passing are likely to be high
>relative to the useful work.  The best parallel quicksort
>above just used two processes (one the original, the other
>a temporary child).  On a quad core machine, four processes
>would make sense.
>erlang-questions mailing list. See http://www.erlang.org/faq.html
>erlang-questions (at) erlang.org

More information about the erlang-questions mailing list