[erlang-questions] Re: Classification of concurrency patterns

Wed May 19 17:29:33 CEST 2010

On May 19, 2010, at 7:07 AM, Vlad Dumitrescu wrote:

> I see, I think there are two issues that got mixed up (at least in my
> mind they did). You started the discussion asking about identifying
> patterns which are abstract and general. This is different than the
> latter issue about pmap and number of cores, I think.

Indeed!

I thought the thread was about the theoretical and practical task of  
identifying and describing concurrent patterns.  I had wanted to  
start a thread on that topic, but wasn't ready to put the thought  
into it so it was a fortunate occurrence when Joe asked about it.   
The issues related to the classification scheme are:

1) Identifying and organizing the types of concurrency
2) Expressing the patterns in a standard way so they can be referenced
3) Coming up with illustrative problems and code solutions
4) Identifying the factors which affect the performance of each

A second topic cropped up as Joe thought about his problem in  
relation to all the abstract patterns:

 > There's no point in expressing the concurrency in an algorithm if the
 > hardware can't do it anyway
 > (or perhaps there is?)

This question has both theoretical and practical considerations.  If  
you are interested in defining and describing concurrency, then it  
should be expressed as several models with the tradeoffs in  
performance that each imply (can it scale up or down, how does it  
deal with failure, partial success and recovery, is it manageable as  
a system...); however, if you are interested in raw, practical  
performance concurrency is not the goal of the problem, but rather  
one possible solution.

There is another side of this and it relates to how one develops  
software.  In the past, we expected compilers and serial performance  
to improve over time.  Code is written and just automatically runs  
faster with newer technology.  Just because we are writing concurrent  
software, we shouldn't abandon this strategy.  Software written for 4  
cores today should run fast enough, but I would hope for it to run  
faster on 8, 16 or more cores if the problem can scale up (it may run  
faster for the same size problem if the granularity is high, but  
should definitely run faster for larger problems if you can easily  
increase the granularity with problem size, which is often done by  
replicating tasks across nodes as map/reduce does).

[By granularity I mean the number of concurrent subtasks, or as the  
inverse size of computation for each subtask.  Think grains of sand  
getting through an hour glass, rather than pebbles.]

Finally, Joe's real motivation is a problem that he can't solve fast  
enough right now on 4 cores:

 > I have to map the problem onto a set of concurrent processes that
 > run on the given hardware.

There are several issues tied up in this statement: the need for a  
solution now, the problem of what to do in a future case of the same  
dilemma and whether theory helps at all.

Practical:

I would find as much concurrency as is possible, implement that and  
measure.  Coalesce concurrency where it introduces bottlenecks (too  
many messages in queue, send as a block or merge two processes to  
eliminate message queue; parallel map => chunked map => serial map).   
Iterate measuring and trying alternatives.

If all coalesces to a serial algorithm, get a faster serial computer  
or redefine the problem so that results may emerge from several  
places in the system rather than after all computation.  Another  
alternative is to pay very special care to the CPU cache usage and  
reformulate the problem to eliminate cache-line faults in low-level  
computation.  This can have a dramatic effect on performance.  It may  
require using NIFs to drop into C for critical sections where you can  
improve things 100-fold with proper cache usage.

Another approach is to use core-oblivious algorithms which attempt to  
use as much concurrency as they can discover.  There is a similar  
approach to CPU Cache usage http://en.wikipedia.org/wiki/ 
Cache_oblivious which essentially does recursive dataset splitting  
until a size is obtained which fits inside your cache.  The  
reassembly of the data will make maximal use of the cache on various  
platforms without specifying specific parameters (you make an  
assumption that, for example, 64 bytes of data will definitely fit in  
the cache and bottom out with an optimal implementation of the 64- 
byte case).

It sounds like you could use a core-oblivious algorithm with cache- 
oblivious algorithms in each process, with autodetection (hopefully  
performance observation rather than poking the hardware, since there  
may be hyper-threading, etc) causing you to adjust the number of core- 
processes.

Future:

Compilers / runtimes should do the work of above.  Java Hotspot  
analyzes execution and reorders instructions.  We should be able to  
detect message queue backups or parallel maps or pipelines that have  
slow spots and rearrange the code to do something equivalent in a  
more efficient way.  The code should be written once and the runtime  
should adapt to the hardware as best it can through observation,  
measurement and tuning in real time.

Theory:

All this practical work, plus a proper classification reference (even  
if not truly hierarchical), could provide the basis for a new  
approach to the VM, new methods of expressing data or processes, and  
new techniques for implementing concurrency vs. serial execution (I  
imagine something like an accordion that runs serial when compressed,  
runs concurrently when stretched and feedback measurements press the  
keys to change the tune).

Nicolai Waniek wrote:

 > Describing a parallel pattern should always mention that one has to
 > figure out if the problem will be best solved in parallel.

I think a guide to concurrency should be just that.  The reader is  
expected to reference other guides to consider alternatives.  The OO  
Patterns book doesn't ask if the problem can be expressed as OO, it  
assumes the path chosen and then provides the alternatives.

jay