[erlang-questions] Parallel List Comprehension

Fri Oct 13 14:10:58 CEST 2006

Sounds sort of like overkill doesn't it?  I want to be able to read in a 
GB-sized binary, chop it into a list of sub-binaries very quickly and 
then filter some subset of the 1M+ elements that result.  The final step 
would be to hang on to the 15 elements I care about, drop all references 
to the original binary and purge the excess from memory.

It would be nice if my shiny new 80-core Intel chip could help me out here!

Granted, if my architecture is such that it requires copying the initial 
data it might not pay off well.  But let's take the current situation 
where I can have 4, 8 or 16 schedulers and presumably can arrange that 
very large binary to be shared.  As long as my elements are big enough, 
they will also be pointing to the original shared binary.  The trick now 
is returning the resulting list.

The 2nd most straight forward approach on four cores would give me (I 
only wrote it this way to show why there are 4 different results that 
are each lists -- Core1 etc represents the first element, Results1 
represents the rest of the list):

[[Core1 | Results1], [Core2 | Results2], [Core3 | Results3], [Core4 | 
Results4]]

(because the 1st most straight forward approach would give me 4 lists 
which would get appended or flattened and result in rewriting those 
millions of cons cells, possibly running out of memory in the process).

Not particularly appealing.  I could accomplish this by slicing the list 
myself, spinning off separate processes and collecting the results.  If 
I could query for the number of schedulers it would be somewhat flexibly 
but still icky.

Now suppose the compiler or runtime could arrange this for me very 
efficiently.  The best I could come up with is an open-cdr coded list 
that gets nconc'ed to reuse the space (hmm, that probably makes no sense 
unless you happen to be an old time lisper).

Suppose the runtime determined there would be a benefit in parallelizing 
the comprehension*.  It could construct an internal "list of open ended 
elements" which was some new internal data type that cannot be generated 
any other way than by a list comprehension.  In the above example, the 
lists are illegal in the following way:

every ResultsX has as its final two elements [LastElement . OpenCdrFlag]

OpenCdrFlag would be a new internal atom that represents a lazy value 
that has not been supplied yet.  A list terminated by OpenCdrFlag can 
only be a member of the internal "list of open ended elements" or a 
return value.

The recipe is like this:

1) Determine how many schedulers to use
2) Split the list into non-overlapping serial sections (could be 
pipelined to each scheduler and have it bite off just the head number of 
elements it needs before passing it on)
3) Run a list comprehension in parallel for each chunk
4) Return each result as an OpenCdrList
5) Collect the results in an internal "list of open ended elements" in 
serial order
6) nconc them together (replace the OpenCdrFlag with a pointer to the 
next open list and null in the last)

An "open ended element" should have a header field that indicates the 
location of the OpenCdrFlag so the append is very fast.

Similar thinking should go into the plans for binary comprehensions.

--------
*Hand waving here...

Ok, initially it might be easiest to let the user decide by using [ 2*X 
||| X <-List]
 where the three vertical bars means do it in parallel.  This would 
allow an experimental implementation that wouldn't impact anyone.  Later 
we could try to make it more automatic.

[The problem is if you have to scan the list to see how big it is 
relative to the number of processors, you might as well not parallelize 
or do the scanning in parallel and abort the comprehension if it will 
take too long.]

---------

Any thoughts?

jay