[erlang-questions] Parallel Shootout & a style question

Thu Sep 4 20:20:32 CEST 2008

--- On Thu, 9/4/08, Gleb Peregud <gleber.p@REDACTED> wrote:
> <mats.cronqvist@REDACTED> wrote:
> > Gleb Peregud wrote:
> >>
> >> It would be nice to include plists (or similar)
> into stdlib
> >> permanently. It will allow users of Erlang/OTP
> fine-tune their
> >> applications for SMP systems just by adding
> "p" (and Malt) where
> >> necessary. I think this issue should be in the
> hands of the
> >> programmer.
> >
> >  that the competent programmer will always be able to
> tune her application
> > to however many cores she's targeting is a truism.
> the issue at hand is what
> > strategy the OTP libraries should use when the
> programmer does not/cannot
> > provide hints (or when running legacy code).
> >
> >> Automatic "parallelization" of lists:*
> will not be
> >> efficient, at least without extensive compile-time
> (or even run-time)
> >> analysis of the code (if this is possible/feasible
> at all).
> >
> >  "will not" is strong wording. can you
> provide some evidence to back that
> > up?
> 
> You are right, I've used too strong words here. I'm
> not able to back them up.

Well, you just need to look back to the 80s and early 90s to be vindicated :-)

Lots and lots of examples there of too fine-grained parallelization meaning the hardware crawled along. In Erlang, we have the basic issue that each spawned task requires a couple of hundred words (I think), so the overhead for a job that requires a few tens of cycles is huge, not a mere 10%.

Simple cases can sometimes be handled by manual annotation, but there are lots of pitfalls. And at the bottom of the pit you not just choke on huge run queues, but run out of memory, etc.

For example, consider a parallel map operation. For what inputs will parallel evaluation beat sequential evaluation? Clearly it depends on the map operation F, the length of the list, and, in the end, even the actual list elements. (In Erlang, we also have higher-order functions to handle, including nested parallel maps passed at runtime. Composing these operations is non-trivial.) Presumably, it might also depend on the overall state of the system, e.g., whether you can afford to spawn more jobs given current memory etc.

So, when should we spawn a task to evaluate E in parallel rather than running it sequentially? In general, we end up with this problem: for an arbitrary expression E, cheaply and precisely predict the cost of evaluating it (sequentially and, for extra credit, taking into consideration dynamic parallelization). There has been some work on doing this automatically, but it's hard to say the least.

(Also, tuning your code to the number of cores is a nightmare to maintain, but everyone knows this, right?)

I would thus expect a great deal of difficulty in attaining high speedups. At a guess, only embarrassingly parallel applications -- happily, Erlang has lots of those -- will utilize 100s of cores successfully. (One successful example of this is the throughput-hungry realm of graphics. Cf. CUDA and others, which appears to be successive passes of SIMD operations over big arrays. And even there, you can get your nose bloodied when trying to predict performance.) Why? Because bottlenecks tend to turn up much earlier than that. Serious tuning seems likely to be needed.

In conclusion, it means good times for the experts :-)

Best,
Thomas