[erlang-questions] Intel Quad CPUs (NVIDIA Tesla & Tilera TILE64)

Tue Sep 4 00:01:05 CEST 2007

> Date: Mon, 3 Sep 2007 18:43:08 +0800
> From: "Hugh Perkins" <hughperkins@REDACTED>
> Subject: Re: [erlang-questions] Intel Quad CPUs
>
> Almost on topic, what could be the benefits and the challenges of
> getting Erlang working on an nVidia  Tesla card?
>
> These cards have between 8 and 128 cores, depending on how you look at
> it. (16 multiprocessors, each running warps of up to 32 threads, every
> 2 clock cycles).

I've been working with CUDA on NVIDIA 8800's, which is the same core  
technology as NVIDIA Tesla.
I don't believe Erlang is a good fit for the underlying hardware.  
Erlang processes nicely map onto independent instruction streams.  
Erlang processes are control-stream oriented, and not data stream  
oriented (like 8800).

An NVIDIA 8800GTX has 128 ALU's, but only concurrently executes 16  
instruction streams, 1 per 'multi-processor' (8 ALU's/multi-processor).
Threads and warps are an abstraction on top of the hardware created  
by the CUDA language and run-time to make it practical to extract the  
majority of the hardware performance.

IMHO getting at the full 8800 performance relies on aligning with a  
few key hardware constraints and mechanisms:
1. All of the code on a multi-processor executes in lock-step; either  
all of the ALU's execute the same instruction, or a subset are idle  
(and you want to maximise the number of ALU's performing work).
2. Memory loads for ALU's must be to sequential addresses, to get at  
the full chip to memory bandwidth.
3. 'Shared' memory access should be sequential, and used a lot, to  
reach it's 1TB/second bandwidth
4. 'Texture memory' access should be 'spatially localised' to exploit  
caching.
5. 'Constant memory' access should be synchronised to exploit caching.
6. 'blocks' of computation should fit into the register set of a  
multi-processor.
7. Limited external communications - the 8800 rely on the host for  
network and disk IO, and the 'pipe' isn't as 'fat' as one might like  
(though fatter than GigEthernet).

These are pretty low-level concepts, and I do not see any software- 
level (vanilla) Erlang support which could exploit these properties  
(we could add features like data-parallel Haskell to Erlang, of course)

So, you could have Erlang, one process/multi-processor, and ignore 7  
of the 8 ALU's, and ignore some of the hardware 'features'. I have no  
problem with that, it's just worth thinking through. I would hope  
that each 8800GTX would be within +/- 3x performance of an Intel quad  
core run in this way (Intel runs about 2x the ALU clock of a GTX, and  
each core of a Quad Core has multiple ALU's which an be exploited by  
ILP in hardware, and Quad Core's caching is managed below the  
instruction level, so it would 'just work').

IMHO, a Erlang processes on NVIDIA 8800/Tesla could be done (while  
ignore much of the specialised hardware), but NVIDIA haven't released  
full processor details, so we'd be restricted to programming in CUDA,  
or the PTX pseudo-assembler, with only partial information. So, it'd  
likely be noticeably slower than an Intel Quad Core for vanilla  
Erlang code.

Having said all of that. I might be interested in having a crack at it !

IMHO, A much more interesting chip for Erlang is the Tilera TILE64:  
http://www.tilera.com/products/processors.php

This seems like a very good fit; lots (64) of independent processors,  
with very high-speed mesh interconnect. If anyone would like to buy  
me a TILExpress-64 CARD (http://www.tilera.com/products/boards.php)  
I'd be very happy to investigate putting Erlang on that.

Garry