[erlang-questions] Application granularity (was: Parallel Shootout & a style question)

Fri Sep 5 08:33:14 CEST 2008

... in response to a flurry of messages about automatically  
parallelizing list comprehensions ...

If I go through all the code I have written and count the number of  
list comprehensions relative to the total amount of code, there are  
not that many occurrences to worry about.  The length of each  
comprehension's data set is not that great in typical code, and  
unless it is a very data parallel algorithm such as matrix  
manipulation, there is little to be gained overall.  Mats toss off of  
10% would over-estimate the likely benefit in the code I typically  
have to implement.

... also references to tuning the number of processes to the number  
of cores ...

Tuning to a specific hardware configuration is folly unless you have  
only one implementation site and never plan on modifying the setup.   
I really would not recommend this approach to programming, unless you  
have a specific problem that can only be achieved today by a  
carefully tuned solution.  I think the majority of cases do not fall  
in this boat.

------

In general, the erlang approach is to isolate sequential code within  
a collection of processes.  The great effort comes in architecting a  
good organization and hierarchy of logic so that failures and  
efficiency are spread to maximum effect.  What is desired is an  
efficient and responsive _application_ rather than an efficient  
snippet of code sprinkled here and there.

In terms of performance, I look to scalability -- running on a newer  
machine should run faster without any tweaks.  Tweaks may improve  
things more, but they should be unnecessary to get the basic speed up.

Quite a while back (a couple decades) I remember hearing about  
attempts to parallelize code.  No one could seem to get a linear  
speed up with the number of processors.  One day it was announced  
that a direct linear speed up had been achieved and it seemed the  
number of processors could be increased without loss of linearity.   
This alchemy was performed by turning the approach upside down.   
Instead of trying to decompose an algorithm into components that were  
independent and could efficiently parallelize, the implementors chose  
to multiply the problem by a few orders of magnitude.  They  
replicated the algorithm and scaled up the problem to produce more  
work than the processors could achieve.  Adding more processors just  
made it run faster.

Over the last few years I have been contemplating the state of  
applications, operating systems and the benefits that erlang offers.   
The biggest advantage is that processes are lightweight and can be  
treated as equivalent to data structures when designing an  
architecture.  Doing so affords an approach to constructing  
applications that is far different from the monolithic structures  
that we currently face, where one failure crashes your entire browser  
(at least until Google Chrome came out).

Instead of futzing with automating the handling of a single vector, I  
submit you should spend your time trying to figure out how to  
structure your application so that it can have at least 1000  
processes.  When you move from 4 core to 8 to 32 or 64, you should  
see linear speed up in your application without modifying anything.   
And all the compiler tools that we currently use will work to your  
advantage without change.

If your application ends up with a bunch of large vectors and lots of  
computation, partition the data to make lots of processes.  If it  
doesn't have large data or computational requirements, partition the  
software components so that they are easier to test and debug and  
they can operate on separate processors or cores.

With the future of hardware continuing towards many core, the new  
measure of the quality of application architecture will be the  
granularity of the components.  They should be small, task specific,  
fault isolating, transparently distributable and interfaced with  
minimal messaging.  Whenever you are confronted with making an  
algorithm more complicated versus keeping it simpler by introducing  
more processes, go with more processes.  If your first implementation  
is fast enough (even if it is 10% slower than it could be), future  
upgrades will automatically scale.

I believe the compiler writers and tool builders should focus on  
making it easier to produce more numerous, but smaller processes,  
rather than trying to make the sequential code eke out an additional  
10% of performance.  I want my code to run 1000x faster when I get a  
1000 core machine.  I likely will need 100,000 processes to realize  
that goal.

jay