[erlang-questions] Parallel List Comprehension
Jay Nelson
jay@REDACTED
Fri Oct 13 14:10:58 CEST 2006
Sounds sort of like overkill doesn't it? I want to be able to read in a
GB-sized binary, chop it into a list of sub-binaries very quickly and
then filter some subset of the 1M+ elements that result. The final step
would be to hang on to the 15 elements I care about, drop all references
to the original binary and purge the excess from memory.
It would be nice if my shiny new 80-core Intel chip could help me out here!
Granted, if my architecture is such that it requires copying the initial
data it might not pay off well. But let's take the current situation
where I can have 4, 8 or 16 schedulers and presumably can arrange that
very large binary to be shared. As long as my elements are big enough,
they will also be pointing to the original shared binary. The trick now
is returning the resulting list.
The 2nd most straight forward approach on four cores would give me (I
only wrote it this way to show why there are 4 different results that
are each lists -- Core1 etc represents the first element, Results1
represents the rest of the list):
[[Core1 | Results1], [Core2 | Results2], [Core3 | Results3], [Core4 |
Results4]]
(because the 1st most straight forward approach would give me 4 lists
which would get appended or flattened and result in rewriting those
millions of cons cells, possibly running out of memory in the process).
Not particularly appealing. I could accomplish this by slicing the list
myself, spinning off separate processes and collecting the results. If
I could query for the number of schedulers it would be somewhat flexibly
but still icky.
Now suppose the compiler or runtime could arrange this for me very
efficiently. The best I could come up with is an open-cdr coded list
that gets nconc'ed to reuse the space (hmm, that probably makes no sense
unless you happen to be an old time lisper).
Suppose the runtime determined there would be a benefit in parallelizing
the comprehension*. It could construct an internal "list of open ended
elements" which was some new internal data type that cannot be generated
any other way than by a list comprehension. In the above example, the
lists are illegal in the following way:
every ResultsX has as its final two elements [LastElement . OpenCdrFlag]
OpenCdrFlag would be a new internal atom that represents a lazy value
that has not been supplied yet. A list terminated by OpenCdrFlag can
only be a member of the internal "list of open ended elements" or a
return value.
The recipe is like this:
1) Determine how many schedulers to use
2) Split the list into non-overlapping serial sections (could be
pipelined to each scheduler and have it bite off just the head number of
elements it needs before passing it on)
3) Run a list comprehension in parallel for each chunk
4) Return each result as an OpenCdrList
5) Collect the results in an internal "list of open ended elements" in
serial order
6) nconc them together (replace the OpenCdrFlag with a pointer to the
next open list and null in the last)
An "open ended element" should have a header field that indicates the
location of the OpenCdrFlag so the append is very fast.
Similar thinking should go into the plans for binary comprehensions.
--------
*Hand waving here...
Ok, initially it might be easiest to let the user decide by using [ 2*X
||| X <-List]
where the three vertical bars means do it in parallel. This would
allow an experimental implementation that wouldn't impact anyone. Later
we could try to make it more automatic.
[The problem is if you have to scan the list to see how big it is
relative to the number of processors, you might as well not parallelize
or do the scanning in parallel and abort the comprehension if it will
take too long.]
---------
Any thoughts?
jay
More information about the erlang-questions
mailing list