[erlang-questions] 1000 cores on a chip

Sat Jan 8 02:42:54 CET 2011

On Fri, Jan 7, 2011 at 04:30, Banibrata Dutta <banibrata.dutta@REDACTED> wrote:

> However, those many cores in GPU world isn't new. CUDA has shown that to us
> for a while. People are making use of GPU for general computation
> increasingly, and a while back had come accross a startup that was using the
> extreme parallelism (in a box) by CUDA for video analytics and video
> encoding/decoding. Not sure how well FPGA's really fare w.r.t. power
> consumption in such setups, but the CUDA GPU units do consume a good bit of
> power.. and generate good bit of heat.

The most important difference is that GPUs tend to be SIMD
architectures with almost no branching instructions. This 1000-core
beast is much closer to a thousand cores each running their own
individual program - making them a better fit for Erlang as it is now.
Suppose you have

case Foo of
  true -> do_x();
  false -> do_y()
end

In (most) GPUs this is compiled into:

case Foo of
  true -> do_x();
  false -> nop
end,
case Foo of
  true -> nop;
  false -> do_y()
end

So essentially the code runs twice: Once for the true branch and once
for the false branch. Why? Because in a SIMD architecture all cores
are locked to execute the same instruction, so you mask out the
operation with a 'nop' for one branch at a time. The more complicated
your branching tree is, the more expensive it is to do this. Many of
the problems where you see GPUs smite the competition is where you
have relatively few branches and can process data in a streamed
fashion, more or less. Video decoding is an excellent example.

Modern GPUs can load more than one program so only part of the GPU has
to run locked to each other, but it is by no means easy to carry out
and even harder to make automatic by compilation. Furthermore modern
GPUs will to a greater and greater extent accept computational error.
If the output is to a screen a single pixel error for 1/60th of a
second isn't noticeable. It has been speculated that you can still use
the GPU for more exact computation in the future though: if you can
bound the error by some measurement, you can run the computation and
then fix up smaller errors on the CPU afterwards. I've seen this trick
used in another setting: Approximate integer calculations on FP
hardware (SSE3) and then fixup the cases where computational error
might have crept in by using the integer unit for this.

The key difference between Erlang and SIMD is that while SIMD leans
itself to parallel computation, it does not lean itself to concurrency
that well. In Erlangs case we would like parallelism and concurrency
at the same time which to some extent is easier pulled off on MIMD
architectures.

In other words: SIMD is very cool (and blazingly fast!) for some
problems. Yet, it probably does not fit into the forte of Erlang that
much and it would take considerable machinery to add it while it is
not clear any benefit would be provided by the addition.

-- 
J.