Binary comprehensions

Fri Aug 13 16:50:57 CEST 2004

Thomas wrote:

> For some special cases, a list comprehension wrapped
> in binary_to_list/list_to_binary would suffice. E.g.,
> 
> list_to_binary([ foo(X) || X <- binary_to_list(Bin)])

Code-wise it reads ok, but if I am a middle-man process
translating from one socket to another it introduces a lot
of overhead.  I would like to receive one binary, transform
and send out the resulting binary in a single step.

> - extracting the bytes from the starting binary: using
> binary_to_list, there is the overhead of an
> intermediate list and the limitation of only
> extracting bytes; better compilation might get rid of
> this. Or maybe just a generalized binary_to_list which
> splits binaries into user-defined chunks. (And a
> library of higher-order operations on binaries?)

I would say standard erlang style says convert to a list
and use the list built-ins, so the only reason to want
binary comprehensions is for speed.  I assumed byte as the
default extraction of elements, but allowed for the size
to be specified as in  || X:4/integer <- Bin  which the
compiler would have to work out.

Either way, I expected the number of elements to be pre-
computable.  As soon as you break the binary into chunks
you need to start allocating the chunks.  In the way I
specified the result is a single binary and I would expect
the original to not be copied, the result to be pre-
allocated and the intermediate transformations to consume
at most one transient binary (although the function call
can do whatever it wants to that binary including unpacking
it building big lists, reassembling and sending back a
result).

> - building the final binary: if we want to directly
> construct the final binary, we have to know an upper
> bound of its size (otherwise we cannot preallocate
> memory for it). Since we can't do that, the general
> case may have to stitch together smaller binaries.
> (Does Erlang/OTP have "heap binaries" these days?)

I made a simplification in my mind that might prove to be
too limiting, but I expect it would cover a lot of useful
situations.  Whatever size of item is extracted on the right
of the || is the size of the transformed elements on the left.
The result binary can be preallocated (even on the stack if
it is thrown away soon after), and then destructive operations
used to set it.  Allow the function on the left to do
whatever it wants but coerce the result to match the size
of the extracted element when constructing the final binary.

This approach seems internally consistent and expected.
For a more general approach, there is no syntactic shortcut --
convert the binary to a list, have at it and generate a
new binary using existing erlang constructs.

> So, my guess is: binary comprehensions can be a
> pleasant notation, but performance wins compared to
> list comprehensions are unclear. It will need a bit of
> compiler work to do well.

> On the other hand, good notation is an end in itself,
> and it might be blazingly fast on selected code :-)

The commonly useful cases are translating binary file
to another format, translating binary streamed data
going from process to process (my pet project), and
manipulating fixed length binary formatted data.

For example (assuming lots of memory, but only 2x
what is needed not more than 3x as might be needed to
convert to a list and back again):

PhotoSize = 1024*1024,
ValidPhotos = << filter(Rec) || Rec:PhotoSize/binary <-
                   ok2(file:read_file("mars_data") >>,
relay_mars_data_from_moon(ValidData).

[Of course, this should be read a record at a time from
the file in binary mode, but you get the idea.]

Luke wrote:

> Speaking of comprehensions, does anyone else think
> that the way multiple generators in a list comprehension
> works is weird?

Definitely.  Haven't come up with a need for that yet.

> I'm always writing code that walks down two lists in
> parallel and very rarely wanting all the combinations.

Common idiom.  Making it easy and clear to specify might
make it more common.  I might even want to change a design
if it made the result clearer.  I've really gotten to like
comprehensions so much that I don't even use lists:foreach
(if I don't pattern-match to a variable it is clear I am
only using it for side-effects).

> Syntactic sugar should optimize for the most common
> cases, right?

I guess that's what I'm advocating in the binary comprehension.
In this case it could clarify code, but the main reason is
for super fast channeling of data in a functional language.

jay