[erlang-questions] Re: Classification of concurrency patterns

Wed May 19 21:34:29 CEST 2010

> Another approach is to use core-oblivious algorithms which attempt to use as much concurrency as they can discover.

> Future:

> Compilers / runtimes should do the work of above.  Java Hotspot analyzes execution and reorders instructions.  We should be able to detect message queue backups or parallel maps or pipelines that have slow spots and rearrange the code to do something equivalent in a more efficient way.  The code should be written once and the runtime should adapt to the hardware as best it can through observation, measurement and tuning in real time.

I think there really separate things: (1) discoverying a machine's
processor, core and memory topology and (2) scheduling instructions
within a core.

Discovering a machine's processor, core and memory topology is
important because of the asymmetry of latencies when accessing memory
that is local to a core or processor vs. memory that is remote (i.e.
NUMA). When you've minimized the probability of accessing remote
memory. Now you can minimize instruction cache misses (can't do
anything for the data cache) using an opportunistic scheduler that
"reuses" instructions that has previously executed (resulting in
out-of-order execution).

These are really about (1) discovering the core topology of the
machine the runtime is executing on taking into account latencies, (2)
affinitizing schedulers for

On Wed, May 19, 2010 at 10:29 AM, Jay Nelson <jay@REDACTED> wrote:
>
> On May 19, 2010, at 7:07 AM, Vlad Dumitrescu wrote:
>
>> I see, I think there are two issues that got mixed up (at least in my
>> mind they did). You started the discussion asking about identifying
>> patterns which are abstract and general. This is different than the
>> latter issue about pmap and number of cores, I think.
>
> Indeed!
>
> I thought the thread was about the theoretical and practical task of identifying and describing concurrent patterns.  I had wanted to start a thread on that topic, but wasn't ready to put the thought into it so it was a fortunate occurrence when Joe asked about it.  The issues related to the classification scheme are:
>
> 1) Identifying and organizing the types of concurrency
> 2) Expressing the patterns in a standard way so they can be referenced
> 3) Coming up with illustrative problems and code solutions
> 4) Identifying the factors which affect the performance of each
>
> A second topic cropped up as Joe thought about his problem in relation to all the abstract patterns:
>
> > There's no point in expressing the concurrency in an algorithm if the
> > hardware can't do it anyway
> > (or perhaps there is?)
>
> This question has both theoretical and practical considerations.  If you are interested in defining and describing concurrency, then it should be expressed as several models with the tradeoffs in performance that each imply (can it scale up or down, how does it deal with failure, partial success and recovery, is it manageable as a system...); however, if you are interested in raw, practical performance concurrency is not the goal of the problem, but rather one possible solution.
>
> There is another side of this and it relates to how one develops software.  In the past, we expected compilers and serial performance to improve over time.  Code is written and just automatically runs faster with newer technology.  Just because we are writing concurrent software, we shouldn't abandon this strategy.  Software written for 4 cores today should run fast enough, but I would hope for it to run faster on 8, 16 or more cores if the problem can scale up (it may run faster for the same size problem if the granularity is high, but should definitely run faster for larger problems if you can easily increase the granularity with problem size, which is often done by replicating tasks across nodes as map/reduce does).
>
> [By granularity I mean the number of concurrent subtasks, or as the inverse size of computation for each subtask.  Think grains of sand getting through an hour glass, rather than pebbles.]
>
> Finally, Joe's real motivation is a problem that he can't solve fast enough right now on 4 cores:
>
> > I have to map the problem onto a set of concurrent processes that
> > run on the given hardware.
>
> There are several issues tied up in this statement: the need for a solution now, the problem of what to do in a future case of the same dilemma and whether theory helps at all.
>
> Practical:
>
> I would find as much concurrency as is possible, implement that and measure.  Coalesce concurrency where it introduces bottlenecks (too many messages in queue, send as a block or merge two processes to eliminate message queue; parallel map => chunked map => serial map).  Iterate measuring and trying alternatives.
>
> If all coalesces to a serial algorithm, get a faster serial computer or redefine the problem so that results may emerge from several places in the system rather than after all computation.  Another alternative is to pay very special care to the CPU cache usage and reformulate the problem to eliminate cache-line faults in low-level computation.  This can have a dramatic effect on performance.  It may require using NIFs to drop into C for critical sections where you can improve things 100-fold with proper cache usage.
>
> Another approach is to use core-oblivious algorithms which attempt to use as much concurrency as they can discover.  There is a similar approach to CPU Cache usage http://en.wikipedia.org/wiki/Cache_oblivious which essentially does recursive dataset splitting until a size is obtained which fits inside your cache.  The reassembly of the data will make maximal use of the cache on various platforms without specifying specific parameters (you make an assumption that, for example, 64 bytes of data will definitely fit in the cache and bottom out with an optimal implementation of the 64-byte case).
>
> It sounds like you could use a core-oblivious algorithm with cache-oblivious algorithms in each process, with autodetection (hopefully performance observation rather than poking the hardware, since there may be hyper-threading, etc) causing you to adjust the number of core-processes.
>
> Future:
>
> Compilers / runtimes should do the work of above.  Java Hotspot analyzes execution and reorders instructions.  We should be able to detect message queue backups or parallel maps or pipelines that have slow spots and rearrange the code to do something equivalent in a more efficient way.  The code should be written once and the runtime should adapt to the hardware as best it can through observation, measurement and tuning in real time.
>
> Theory:
>
> All this practical work, plus a proper classification reference (even if not truly hierarchical), could provide the basis for a new approach to the VM, new methods of expressing data or processes, and new techniques for implementing concurrency vs. serial execution (I imagine something like an accordion that runs serial when compressed, runs concurrently when stretched and feedback measurements press the keys to change the tune).
>
>
> Nicolai Waniek wrote:
>
> > Describing a parallel pattern should always mention that one has to
> > figure out if the problem will be best solved in parallel.
>
> I think a guide to concurrency should be just that.  The reader is expected to reference other guides to consider alternatives.  The OO Patterns book doesn't ask if the problem can be expressed as OO, it assumes the path chosen and then provides the alternatives.
>
> jay
>
>
> ________________________________________________________________
> erlang-questions (at) erlang.org mailing list.
> See http://www.erlang.org/faq.html
> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>