[erlang-questions] Some comments on EEP 9 (binary: module)
Jay Nelson
jay@REDACTED
Sat Mar 8 20:33:32 CET 2008
> > 7. "nth" is a very strange name for a substring operation.
> > I would prefer
> > subbinary(Binary, Offset[, Length])
> > 0 =< Offset =< byte_size(Binary)
> > 0 =< Length =< byte_size(Binary) - Offset
> > which would make this compatible with the existing
> > split_binary(Binary, Offset)
> > function.
> Agreed, I was not pleased with the name myself. It is a leftover from
> the first version where it picked out one byte at the nth position
> (similar to lists:nth/2 operating on a string). I am planning (as per
> the suggestions here on the mailing list) to split the module into
> two, where one should be called binary_string or similar. In that
> module it should be named and behave like string:sub_string/2,3.
Be careful in choosing the name for the function which extracts a
subsequence from a binary. There is already a concept in erlang of a
'subbinary' (not sure if there is a dash or underscore but I think
the docs and code refer to it all jammed together as one word), which
specifically represents a minimal structure which points into another
binary (or subbinary) so that there is no copying of contiguous
elements to create a new binary.
Whatever nomenclature is chosen, the semantics of subbinary should be
preserved (possibly even to the point of having a separate module
called subbinary which guarantees to operate on them in an efficient
manner and identifies explicitly when a binary is returned and when a
subbinary is returned).
So, I am proposing that if a function such as binary:subbinary/2,3 is
provided, it be documented and guaranteed that it doesn't copy the
binary elements in constructing a result. If a new binary with copy
semantics is the desired result, a different function name (for
example, binary:copy_slice/2,3 or binary:copy_subseq/2,3) be provided.
Likewise binary_string:subbinary/2,3 would not copy, while
binary_string:sub_string/2,3 would.
I'm not sure if the distinction of copying is necessary by having two
sets of functions, or whether a binary:copy_binary/1 function could
do the dirty work when needed, thereby only requiring all subseq
operations to return a real 'subbinary' and the user explicitly
copying when desired. In general, this approach would be the best I
think by keeping the module signature to a minimal set of function.
------
A scenario I currently use is to read a text file as a single binary,
scan it to create a list of subbinaries (for example, of all the
configuration terms and values), then filter for some subset of the
list which I want to continue using. I then would like to discard
the large binary and all the unused binaries. It is almost an
explicit garbage collect on one structure using application specific
knowledge.
A BIF should be provided which guarantees the return of a deep list
of fresh copies of the binaries passed to it in a deep list:
binary:copy_binaries(DeepListOfBinaries)
The application code would be something like this:
get_config_params() ->
BigBin = load_binary(...),
ParsedBins = parse_binary(BigBin),
Keepers = filter_config_params(ParsedBins),
FreshBins = binary:copy_binaries(Keepers).
On return, the references to BigBin and all subbinaries parsed out
are dropped, so only the FreshBins will be kept on the next garbage
collection sweep. The key is to guarantee all references to BigBin
are eliminated by copying the subbinaries to fresh memory.
jay
More information about the erlang-questions
mailing list