[erlang-questions] 1000 cores on a chip

Mon Jan 10 05:38:11 CET 2011

Thanks, Jesper. That was very informative and enlightening.

Given your description, it clearly looks like Erlang (on general purpose
multi-core CPUs) + specific algos running on CUDA(/-like) architecture can
create a very powerful solution mix.

BTW, an OT info as many on this list might have noted is that Amazon has
started offering CUDA GPU processing power as part of it's AWS offering.

On Sat, Jan 8, 2011 at 7:12 AM, Jesper Louis Andersen <
jesper.louis.andersen@REDACTED> wrote:

> On Fri, Jan 7, 2011 at 04:30, Banibrata Dutta <banibrata.dutta@REDACTED>
> wrote:
>
> > However, those many cores in GPU world isn't new. CUDA has shown that to
> us
> > for a while. People are making use of GPU for general computation
> > increasingly, and a while back had come accross a startup that was using
> the
> > extreme parallelism (in a box) by CUDA for video analytics and video
> > encoding/decoding. Not sure how well FPGA's really fare w.r.t. power
> > consumption in such setups, but the CUDA GPU units do consume a good bit
> of
> > power.. and generate good bit of heat.
>
> The most important difference is that GPUs tend to be SIMD
> architectures with almost no branching instructions. This 1000-core
> beast is much closer to a thousand cores each running their own
> individual program - making them a better fit for Erlang as it is now.
> Suppose you have
>
> case Foo of
>  true -> do_x();
>  false -> do_y()
> end
>
> In (most) GPUs this is compiled into:
>
> case Foo of
>  true -> do_x();
>  false -> nop
> end,
> case Foo of
>  true -> nop;
>  false -> do_y()
> end
>
> So essentially the code runs twice: Once for the true branch and once
> for the false branch. Why? Because in a SIMD architecture all cores
> are locked to execute the same instruction, so you mask out the
> operation with a 'nop' for one branch at a time. The more complicated
> your branching tree is, the more expensive it is to do this. Many of
> the problems where you see GPUs smite the competition is where you
> have relatively few branches and can process data in a streamed
> fashion, more or less. Video decoding is an excellent example.
>
> Modern GPUs can load more than one program so only part of the GPU has
> to run locked to each other, but it is by no means easy to carry out
> and even harder to make automatic by compilation. Furthermore modern
> GPUs will to a greater and greater extent accept computational error.
> If the output is to a screen a single pixel error for 1/60th of a
> second isn't noticeable. It has been speculated that you can still use
> the GPU for more exact computation in the future though: if you can
> bound the error by some measurement, you can run the computation and
> then fix up smaller errors on the CPU afterwards. I've seen this trick
> used in another setting: Approximate integer calculations on FP
> hardware (SSE3) and then fixup the cases where computational error
> might have crept in by using the integer unit for this.
>
>
> The key difference between Erlang and SIMD is that while SIMD leans
> itself to parallel computation, it does not lean itself to concurrency
> that well. In Erlangs case we would like parallelism and concurrency
> at the same time which to some extent is easier pulled off on MIMD
> architectures.
>
> In other words: SIMD is very cool (and blazingly fast!) for some
> problems. Yet, it probably does not fit into the forte of Erlang that
> much and it would take considerable machinery to add it while it is
> not clear any benefit would be provided by the addition.
>
>
>
> --
> J.
>

-- 
regards,
Banibrata
http://www.linkedin.com/in/bdutta
http://twitter.com/edgeliving