[erlang-questions] Some comments on EEP 9 (binary: module)

Sat Mar 8 17:56:37 CET 2008

Thanks for looking at the EEP. Please see comments below:

On Fri, Mar 7, 2008 at 4:32 AM, Richard A. O'Keefe <ok@REDACTED> wrote:
> 1.  Given that the module for lists is called 'lists', not 'list',
>      it is rather confusing that the module for binaries is called
>  'binary',
>      instead of the expected 'binaries'.
>
>      On the other hand, given that the module for strings is called
>  "string",
>      maybe it's 'lists' that has the wrong name.  Something needs to
>  be done
>      about naming consistency in modules for data types.

I have noted the same thing and decided to make it like string, queue
and array rather than like lists or sets. Binary/binary_string has the
added advantage of also being slightly shorter than
binaries/binary_strings.

>  2.  What do you return if you look for something and it isn't there?
>      For some reason, people seem to like returning an out-of-range
>  index
>      at the wrong end.  BASIC does this, Smalltalk does it, but that
>  does not
>      make it right.  Life gets *so* much simpler if match(Haystack,
>  Needle)
>      returns an index past the *right* end of the haystack.  Suppose,
>  for
>      example, we have input with an optional comment; we want to
>  remove it.
>      [Mind you, EEP 9 is handicapped by starting from a system where
>       binary slicing uses just about the worst possible convention,
>  but that is
>       another and sadder story.]
>      [Oh yes, the documentation for the erlang: module gets
>  erlang:split_bionary/2
>       wrong.  It says that the range for Pos is 1..size(Bin), but 0 is
>  *rightly*
>       allowed.  Pos is actually the size of the first part, which is
>  just right.]
>
>     Example.  Suppose we are given a line of text from some
>  configuration file
>     as a binary.  It might contain a # comment or it might not.  Our
>  only interest is
>     in getting rid of it.  In a rational design, where
>         match(Haystack, Needle)
>     returns the length of the longest prefix of Haystack *not*
>  containing Needle,
>     we just do
>         {Wanted,_} = split_binary(Given, match(Given, <<"#">>))
>      With the scheme actually proposed, we have to do
>         case match(Given, <<"#">>)
>            of 0 -> Wanted = Given
>             ; N -> {Wanted,_} = split_binary(Given, N-1)
>          end

There are already two different return values in two different library
modules for the "not found scenario".

regexp:match/2 -> nomatch
string:[r]str/2 -> 0
string:[r]chr/2 -> 0

I am afraid that adding a third way of marking "not found" in a third
library is only going to add to the confusion.

>  3.  I appreciate that slicing binaries is supposed to be cheap, but I
>  still
>      think it would be nice if match had a 3rd argument, saying how
>  many bytes
>      at the beginning of Haystack to skip.  If it weren't 4pm on a
>  Friday with
>      my office floor still to tidy up, I could give examples of why
>  this can
>      make life simpler.

I am not sure I understand. There is already a match function in the
EEP where you can specify where to start the matching.

<eep>
match(Haystack, Needles, {StartIndex, EndIndex}) -> Return
</eep>

>  4.  I agree that the proposed binary:split/2 function is useful, but
>  the name
>      is far too close to split_binary/2 for comfort.  A longer name
>  such as
>         binaries:split_with_separator(Binary, Separator_Binary)
>      might make for less confusion.  Better still, why not make this
>  like
>      string:tokens/2, which really has exactly the same purpose except
>  for the
>      data type it applies to?

The proposed split function is actually more like regexp:split/2 than
string:tokens/2. As you know string:tokens/2 takes a list of separator
chars and splits by any of the chars, which I usually find is *not*
what I want it to do. Thus naming it tokens will give the wrong
associations since it is already implemented in another way in the
string library.

string:tokens("cat and dog", "and").
["c","t "," ","og"]
regexp:split("cat and dog", "and").
{ok,["cat "," dog"]}

Then again as the function is specified you could mimic the behaviour
of tokens by giving binary:split/2 a list of binaries [<<"a">>,
<<"n">>, <<"d">>], although you would end up with some empty binaries
as well, just like you have the empty lists in regexp:split():
2> regexp:split("cat and dog","a|n|d").
{ok,["c","t ",[],[]," ","og"]}

Maybe split_with/2 or split_by/2 to keep it reasonable short?

>  5.  Ever since I met SNOBOL 4, I have known the operation of removing
>  outer
>      blanks from a string as trimmming.  It's a little odd to find it
>  called
>      stripping.  By analogy with ecdysiasis (hem hem) I would expect
>  stripping
>      to remove visible outer stuff.  I wish string:strip/[1,2,3] could
>  be
>      renamed.

This is analogous to string:strip/1. While I agree trim is a better
name I think that calling it trim in one library and strip in another
will only add confusion and strange questions to erlang-questions in
the future.

>
>  6.  In the functions
>         unsigned_to_bin/1
>         bin_to_unsigned/1
>      why is the word "binary" abbreviated to "bin"?

Agreed. I will change bin to binary.

>  7.  "nth" is a very strange name for a substring operation.
>      I would prefer
>         subbinary(Binary, Offset[, Length])
>         0 =< Offset =< byte_size(Binary)
>         0 =< Length =< byte_size(Binary) - Offset
>      which would make this compatible with the existing
>         split_binary(Binary, Offset)
>      function.

Agreed, I was not pleased with the name myself. It is a leftover from
the first version where it picked out one byte at the nth position
(similar to lists:nth/2 operating on a string). I am planning (as per
the suggestions here on the mailing list) to split the module into
two, where one should be called binary_string or similar. In that
module it should be named and behave like string:sub_string/2,3.

BR /Fredrik