Erlang on the niagara

Thu Jun 8 16:50:11 CEST 2006

I've been having fun :-)

How can you parallelise an existing (Erlang) program?

As an experiment I parallelised my document generator, this is a program
which batch converts my new wiki-file documentation format into HTML.

As a small test I ran this on a directory of 63 wiki files which are
converted into HTML files, totalling 1.35 MBytes of HTML (ie 21.9
KBytes/file on average) - this is a very typical batch processing
task, with an extremely simple logical structure.

I usually write batch converters something like this:

        all() ->
	    Dir = "../wik",
            Files = file:files(Dir, "*.wik"),
            lists:foreach(fun(I) -> xform_file(I) end, Files),
	    ...

Parallelising this is easy, all I needed was a simple "parallel
foreach",
actually I just made a pmap (parallel map) and used that instead. pmap
is easy:

pmap(F, L) -> 
    S = self(),
    Pids = map(fun(I) -> 
		       spawn(fun() -> do_f(S, F, I) end)
	       end, L),
    gather(Pids).

gather([H|T]) ->
    receive
	{H, Ret} -> [Ret|gather(T)]
    end;
gather([]) ->
    [].

do_f(Parent, F, I) ->					    
    Parent ! {self(), (catch F(I))}.

Then I just replaced the call to lists:foreach with pmap

Now I could run on the niagara.

The results were as follows:

	#CPUs   Speedup
	1	0,953
	2	1,855
	3	2,679
	4	3,44
	5	4,012
	6	4,624
	7	5,093
	8	5,46
	9	5,73
	10	6,11
	11	6,108
	12	6,47
	13	6,58
	14	6,8
	15	6,67
	16	7
	17	6,99
	18	7,29
	19	6,97
	20	6,97
	21	6,74
	22	6,86
	23	7,07
	24	6,85

What I'm measuring here is the speedup as a function of the number of
schedulers enabled in the SMP erlang. The speedup factor is just
computed
as the pmap/foreach ratio - ie I did the conversion twice, one with
foreach
the second time with pmap.

As you can see we get near linear speedup for 1-7 CPUs - then a gradual
tailing off, with a plateau reached at 16 CPUs, above 16 CPUs we can do
no better.

Realistically a speedup of 7 was achieved.

Why can't we speed up beyond a factor 7 - who knows? - at some stage
things do get serialised, there is after all only one disk on the
machine, and the SMP erlang has to serialise all disk I/O, even if the
programs generating data run in parallel, also, the CPU caches will
get filled and have to swap into shared memory.

Nevertheless, I'm encouraged by this result? Why? because

	- this kind of program is a typical of a large class
	  of programs (ie a lot of my program just do boring things
	  to large numbers of files)
	  Usually batch processes on individual files are sequential 
	  often there is not much intrinsic concurrency in a typical
	  file In -> file Out program.

	  The concurrency comes from the fact that the top loop of
	  the program can be changed from a:

		forAllFiles Do ... end

	 structure to a:

		forAllFiles DoInParallel   ... end

	structure.

        Now this only requires a ONE LINE change to the program
	(change a foreach to a pmap)

	- My program went 7 times faster.

        At this stage this is very nice - usually when a program is
complete
	there is little that can be done to speed it up - since I
usually
	chose my data structures and algorithms with some care. Sure I
	*could* speed things up with messy optimised code, but the code
would
	be less beautiful, and more difficult to maintain.

	Changing a foreach to a pmap means the programs goes faster
	(x 7) AND stays beautiful - the best of both worlds.