[erlang-questions] Some comments on EEP 9 (binary: module)

Jay Nelson jay@REDACTED
Sat Mar 8 20:33:32 CET 2008


 > >  7.  "nth" is a very strange name for a substring operation.
 > >      I would prefer
 > >         subbinary(Binary, Offset[, Length])
 > >         0 =< Offset =< byte_size(Binary)
 > >         0 =< Length =< byte_size(Binary) - Offset
 > >      which would make this compatible with the existing
 > >         split_binary(Binary, Offset)
 > >      function.

 > Agreed, I was not pleased with the name myself. It is a leftover from
 > the first version where it picked out one byte at the nth position
 > (similar to lists:nth/2 operating on a string). I am planning (as per
 > the suggestions here on the mailing list) to split the module into
 > two, where one should be called binary_string or similar. In that
 > module it should be named and behave like string:sub_string/2,3.


Be careful in choosing the name for the function which extracts a  
subsequence from a binary.  There is already a concept in erlang of a  
'subbinary' (not sure if there is a dash or underscore but I think  
the docs and code refer to it all jammed together as one word), which  
specifically represents a minimal structure which points into another  
binary (or subbinary) so that there is no copying of contiguous  
elements to create a new binary.

Whatever nomenclature is chosen, the semantics of subbinary should be  
preserved (possibly even to the point of having a separate module  
called subbinary which guarantees to operate on them in an efficient  
manner and identifies explicitly when a binary is returned and when a  
subbinary is returned).

So, I am proposing that if a function such as binary:subbinary/2,3 is  
provided, it be documented and guaranteed that it doesn't copy the  
binary elements in constructing a result.  If a new binary with copy  
semantics is the desired result, a different function name (for  
example, binary:copy_slice/2,3 or binary:copy_subseq/2,3) be provided.

Likewise binary_string:subbinary/2,3 would not copy, while  
binary_string:sub_string/2,3 would.

I'm not sure if the distinction of copying is necessary by having two  
sets of functions, or whether a binary:copy_binary/1 function could  
do the dirty work when needed, thereby only requiring all subseq  
operations to return a real 'subbinary' and the user explicitly  
copying when desired.  In general, this approach would be the best I  
think by keeping the module signature to a minimal set of function.

------

A scenario I currently use is to read a text file as a single binary,  
scan it to create a list of subbinaries (for example, of all the  
configuration terms and values), then filter for some subset of the  
list which I want to continue using.  I then would like to discard  
the large binary and all the unused binaries.  It is almost an  
explicit garbage collect on one structure using application specific  
knowledge.

A BIF should be provided which guarantees the return of a deep list  
of fresh copies of the binaries passed to it in a deep list:

binary:copy_binaries(DeepListOfBinaries)

The application code would be something like this:

get_config_params() ->
    BigBin = load_binary(...),
    ParsedBins = parse_binary(BigBin),
    Keepers = filter_config_params(ParsedBins),
    FreshBins = binary:copy_binaries(Keepers).

On return, the references to BigBin and all subbinaries parsed out  
are dropped, so only the FreshBins will be kept on the next garbage  
collection sweep.  The key is to guarantee all references to BigBin  
are eliminated by copying the subbinaries to fresh memory.

jay




More information about the erlang-questions mailing list