[erlang-questions] Re: Classification of concurrency patterns
Jay Nelson
jay@REDACTED
Wed May 19 17:29:33 CEST 2010
On May 19, 2010, at 7:07 AM, Vlad Dumitrescu wrote:
> I see, I think there are two issues that got mixed up (at least in my
> mind they did). You started the discussion asking about identifying
> patterns which are abstract and general. This is different than the
> latter issue about pmap and number of cores, I think.
Indeed!
I thought the thread was about the theoretical and practical task of
identifying and describing concurrent patterns. I had wanted to
start a thread on that topic, but wasn't ready to put the thought
into it so it was a fortunate occurrence when Joe asked about it.
The issues related to the classification scheme are:
1) Identifying and organizing the types of concurrency
2) Expressing the patterns in a standard way so they can be referenced
3) Coming up with illustrative problems and code solutions
4) Identifying the factors which affect the performance of each
A second topic cropped up as Joe thought about his problem in
relation to all the abstract patterns:
> There's no point in expressing the concurrency in an algorithm if the
> hardware can't do it anyway
> (or perhaps there is?)
This question has both theoretical and practical considerations. If
you are interested in defining and describing concurrency, then it
should be expressed as several models with the tradeoffs in
performance that each imply (can it scale up or down, how does it
deal with failure, partial success and recovery, is it manageable as
a system...); however, if you are interested in raw, practical
performance concurrency is not the goal of the problem, but rather
one possible solution.
There is another side of this and it relates to how one develops
software. In the past, we expected compilers and serial performance
to improve over time. Code is written and just automatically runs
faster with newer technology. Just because we are writing concurrent
software, we shouldn't abandon this strategy. Software written for 4
cores today should run fast enough, but I would hope for it to run
faster on 8, 16 or more cores if the problem can scale up (it may run
faster for the same size problem if the granularity is high, but
should definitely run faster for larger problems if you can easily
increase the granularity with problem size, which is often done by
replicating tasks across nodes as map/reduce does).
[By granularity I mean the number of concurrent subtasks, or as the
inverse size of computation for each subtask. Think grains of sand
getting through an hour glass, rather than pebbles.]
Finally, Joe's real motivation is a problem that he can't solve fast
enough right now on 4 cores:
> I have to map the problem onto a set of concurrent processes that
> run on the given hardware.
There are several issues tied up in this statement: the need for a
solution now, the problem of what to do in a future case of the same
dilemma and whether theory helps at all.
Practical:
I would find as much concurrency as is possible, implement that and
measure. Coalesce concurrency where it introduces bottlenecks (too
many messages in queue, send as a block or merge two processes to
eliminate message queue; parallel map => chunked map => serial map).
Iterate measuring and trying alternatives.
If all coalesces to a serial algorithm, get a faster serial computer
or redefine the problem so that results may emerge from several
places in the system rather than after all computation. Another
alternative is to pay very special care to the CPU cache usage and
reformulate the problem to eliminate cache-line faults in low-level
computation. This can have a dramatic effect on performance. It may
require using NIFs to drop into C for critical sections where you can
improve things 100-fold with proper cache usage.
Another approach is to use core-oblivious algorithms which attempt to
use as much concurrency as they can discover. There is a similar
approach to CPU Cache usage http://en.wikipedia.org/wiki/
Cache_oblivious which essentially does recursive dataset splitting
until a size is obtained which fits inside your cache. The
reassembly of the data will make maximal use of the cache on various
platforms without specifying specific parameters (you make an
assumption that, for example, 64 bytes of data will definitely fit in
the cache and bottom out with an optimal implementation of the 64-
byte case).
It sounds like you could use a core-oblivious algorithm with cache-
oblivious algorithms in each process, with autodetection (hopefully
performance observation rather than poking the hardware, since there
may be hyper-threading, etc) causing you to adjust the number of core-
processes.
Future:
Compilers / runtimes should do the work of above. Java Hotspot
analyzes execution and reorders instructions. We should be able to
detect message queue backups or parallel maps or pipelines that have
slow spots and rearrange the code to do something equivalent in a
more efficient way. The code should be written once and the runtime
should adapt to the hardware as best it can through observation,
measurement and tuning in real time.
Theory:
All this practical work, plus a proper classification reference (even
if not truly hierarchical), could provide the basis for a new
approach to the VM, new methods of expressing data or processes, and
new techniques for implementing concurrency vs. serial execution (I
imagine something like an accordion that runs serial when compressed,
runs concurrently when stretched and feedback measurements press the
keys to change the tune).
Nicolai Waniek wrote:
> Describing a parallel pattern should always mention that one has to
> figure out if the problem will be best solved in parallel.
I think a guide to concurrency should be just that. The reader is
expected to reference other guides to consider alternatives. The OO
Patterns book doesn't ask if the problem can be expressed as OO, it
assumes the path chosen and then provides the alternatives.
jay
More information about the erlang-questions
mailing list