[erlang-questions] Some comments on EEP 9 (binary: module)
Fredrik Svahn
fredrik.svahn@REDACTED
Sat Mar 8 17:56:37 CET 2008
Thanks for looking at the EEP. Please see comments below:
On Fri, Mar 7, 2008 at 4:32 AM, Richard A. O'Keefe <ok@REDACTED> wrote:
> 1. Given that the module for lists is called 'lists', not 'list',
> it is rather confusing that the module for binaries is called
> 'binary',
> instead of the expected 'binaries'.
>
> On the other hand, given that the module for strings is called
> "string",
> maybe it's 'lists' that has the wrong name. Something needs to
> be done
> about naming consistency in modules for data types.
I have noted the same thing and decided to make it like string, queue
and array rather than like lists or sets. Binary/binary_string has the
added advantage of also being slightly shorter than
binaries/binary_strings.
> 2. What do you return if you look for something and it isn't there?
> For some reason, people seem to like returning an out-of-range
> index
> at the wrong end. BASIC does this, Smalltalk does it, but that
> does not
> make it right. Life gets *so* much simpler if match(Haystack,
> Needle)
> returns an index past the *right* end of the haystack. Suppose,
> for
> example, we have input with an optional comment; we want to
> remove it.
> [Mind you, EEP 9 is handicapped by starting from a system where
> binary slicing uses just about the worst possible convention,
> but that is
> another and sadder story.]
> [Oh yes, the documentation for the erlang: module gets
> erlang:split_bionary/2
> wrong. It says that the range for Pos is 1..size(Bin), but 0 is
> *rightly*
> allowed. Pos is actually the size of the first part, which is
> just right.]
>
> Example. Suppose we are given a line of text from some
> configuration file
> as a binary. It might contain a # comment or it might not. Our
> only interest is
> in getting rid of it. In a rational design, where
> match(Haystack, Needle)
> returns the length of the longest prefix of Haystack *not*
> containing Needle,
> we just do
> {Wanted,_} = split_binary(Given, match(Given, <<"#">>))
> With the scheme actually proposed, we have to do
> case match(Given, <<"#">>)
> of 0 -> Wanted = Given
> ; N -> {Wanted,_} = split_binary(Given, N-1)
> end
There are already two different return values in two different library
modules for the "not found scenario".
regexp:match/2 -> nomatch
string:[r]str/2 -> 0
string:[r]chr/2 -> 0
I am afraid that adding a third way of marking "not found" in a third
library is only going to add to the confusion.
> 3. I appreciate that slicing binaries is supposed to be cheap, but I
> still
> think it would be nice if match had a 3rd argument, saying how
> many bytes
> at the beginning of Haystack to skip. If it weren't 4pm on a
> Friday with
> my office floor still to tidy up, I could give examples of why
> this can
> make life simpler.
I am not sure I understand. There is already a match function in the
EEP where you can specify where to start the matching.
<eep>
match(Haystack, Needles, {StartIndex, EndIndex}) -> Return
</eep>
> 4. I agree that the proposed binary:split/2 function is useful, but
> the name
> is far too close to split_binary/2 for comfort. A longer name
> such as
> binaries:split_with_separator(Binary, Separator_Binary)
> might make for less confusion. Better still, why not make this
> like
> string:tokens/2, which really has exactly the same purpose except
> for the
> data type it applies to?
The proposed split function is actually more like regexp:split/2 than
string:tokens/2. As you know string:tokens/2 takes a list of separator
chars and splits by any of the chars, which I usually find is *not*
what I want it to do. Thus naming it tokens will give the wrong
associations since it is already implemented in another way in the
string library.
string:tokens("cat and dog", "and").
["c","t "," ","og"]
regexp:split("cat and dog", "and").
{ok,["cat "," dog"]}
Then again as the function is specified you could mimic the behaviour
of tokens by giving binary:split/2 a list of binaries [<<"a">>,
<<"n">>, <<"d">>], although you would end up with some empty binaries
as well, just like you have the empty lists in regexp:split():
2> regexp:split("cat and dog","a|n|d").
{ok,["c","t ",[],[]," ","og"]}
Maybe split_with/2 or split_by/2 to keep it reasonable short?
> 5. Ever since I met SNOBOL 4, I have known the operation of removing
> outer
> blanks from a string as trimmming. It's a little odd to find it
> called
> stripping. By analogy with ecdysiasis (hem hem) I would expect
> stripping
> to remove visible outer stuff. I wish string:strip/[1,2,3] could
> be
> renamed.
This is analogous to string:strip/1. While I agree trim is a better
name I think that calling it trim in one library and strip in another
will only add confusion and strange questions to erlang-questions in
the future.
>
> 6. In the functions
> unsigned_to_bin/1
> bin_to_unsigned/1
> why is the word "binary" abbreviated to "bin"?
Agreed. I will change bin to binary.
> 7. "nth" is a very strange name for a substring operation.
> I would prefer
> subbinary(Binary, Offset[, Length])
> 0 =< Offset =< byte_size(Binary)
> 0 =< Length =< byte_size(Binary) - Offset
> which would make this compatible with the existing
> split_binary(Binary, Offset)
> function.
Agreed, I was not pleased with the name myself. It is a leftover from
the first version where it picked out one byte at the nth position
(similar to lists:nth/2 operating on a string). I am planning (as per
the suggestions here on the mailing list) to split the module into
two, where one should be called binary_string or similar. In that
module it should be named and behave like string:sub_string/2,3.
BR /Fredrik
More information about the erlang-questions
mailing list