[eeps] EEP 31

Mon Dec 14 10:01:01 CET 2009

Hi!

Obviusly I was getting some things very wrong! I'm commenting below!

On Fri, 11 Dec 2009, Robert Virding wrote:

> Comments follow.
>
> 2009/12/10 Patrik Nyblom <pan@REDACTED>
>
>>
>> - I understand that giving list of binaries for a pattern means that they
>>> are alternatives. I could not see where this was stated and I wonder if it
>>> not a little confusing?
>>>
>>
>>
>> I can't really see what they would otherwise be - what would be the
>> alternative interpretation? I could clarify that of course.
>>
>
> Well, the only (?) other place where you interpret lists of binaries is in
> io_lists, and there it
> means the concatenation of the binaries. This is what I thought of when I
> first saw it, though I couldn't understand why you didn't just have one
> binary. :-)
>

Ah! Yes of course - you're right. Didn't think of that interpretation. 
I'll clarify that!

> - Does 're' always return the first *shortest* match? I thought that it
>>> returned the first matching alternative irrespective of length. Could be
>>> wrong though. Always returning the shortest is very restrictive.
>>>
>>
>> If you are to return non-overlapping matches and want to return the first
>> match, you could either select the longest or the shortest at the first
>> matching point. Without an option to control that, you of course have to
>> select one and be consistent. The "shortest" behaviour is consistent with
>> re, but not with regexp, which uses the other strategy. The regexp module
>> does therefore not return the largest number of non overlapping matches
>> possible, as opposed to re. I think the re behaviour is the best and has
>> chosen that for the binary module.
>>
>
> No, no, no. First 'regexp' follows POSIX and always returns the longest
> match where possible; there is no way of affecting this. The 're' module
> does the same as Perl, being PCRE, and it *does* take the lexically first
> alternative in a match pattern. This gives you some control, Friedl shows
> how you can use it. So from you example:
>
> 1> re:run("abc","ab|abc",[global]).
> {match,[[{0,2}]]}
> 2> re:run("abc","abc|ab",[global]).
> {match,[[{0,3}]]}                                 <===
> 3> re:run("abc","ab|abc|c",[global]).
> {match,[[{0,2}],[{2,1}]]}
> 4> re:run("abc","abc|ab|c",[global]).
> {match,[[{0,3}]]}                                 <===
>
> I think that if you are going to have alternative binaries then it would be
> best to to choose the same way. There will a difficulty in explaining how
> you mean first of the first, pretty much like trying to explain how you pick
> messages of the message queue in a receive with multiple patterns.
>
> - This is the first time we have a copy function.

Yes! You are oh so right and I'm oh so wrong! I now realize that the 
regexp module follows the awk convention and also think, now that I've 
looked into it more, that that would be the "right" way to do it. Sorry I 
doubted you - I will try to not make the same mistake again :)

Surprisingly enough (?), I didn't realize that re used the lexical order, 
I hadn't examined it thouroughly enough it seems. That's the problem with 
adopting someone elses library I suppose. It would have helped if I didn't 
loathe Perl too... I'll write "Dont assume - make sure!" 1000 times on my 
whiteboard!!!

So, now I thought of the patterns as sets more than lists of alternatives. 
This makes (in my head at least) the lexical order awkward. I'd like to 
rethink regarding re compatibility in this case. If I chose the longest 
match instead, what would the implications be? You obviously don't get all 
possible non-overlapping matches, and you're not really getting the same 
matches as with re, but you *will* get the same as in regexp and as one 
got from the "ancient masters". I think that would be a better choise.

>>>
>>
>>
>> Yes. The usage is twofold and not really attractive. The need for such a
>> "cloning" function is however obvious from discussions on erlang-questions.
>>
>
> One thing that worries me with having a "cloning" function like copy, and
> the other functions like referenced_byte_size is that with them you try to
> make global decisions with local data. If you get it wrong it can go so
> horribly wrong.
>

Yes - I agree. However the need to avoid sharing in certains situations is 
urging... I don't like this kind of interfaces, but it sure seems needed 
and I preffer them to be spelled out instead of implicit in some function 
like list_to_binary...

>
>>
>>> - I am really serious about numbering from 0.
>>>
>>
>>
>> Understood. I won't however change that unless you buy Ericsson or OTP and
>> enforce new design rules ;) ... In which case I would not change it anyway
>> because you would fire me on the instant I suspect :D
>
>
> Well, be careful I have a few shares and I will start saving money so I can
> buy up the company! :-)

:D

>
>
>> To summarize:
>>
>> - I'll clarify that multiple search strings mean they are alternatives.
>> - I'll think about an option for getting the longest matches instead of
>>  the shortest when having multiple alternatives if that is really
>>  interesting to anyone. What do others think? - Any suggestion for other
>> interface/interfaces than the copy-functions?
>>  Including the spelled out twofold functionality?
>>
>
> Here take the first alternative as in 're' which takes the lexically first,
> see example.
>

I'd rather take the longest, as I wrote earlier - I would like the 
alternatives to be viewed as a set.

>
>> - Zero based indices will not change, although we fully understand the
>>  objection.
>>
>
> Then I think you should have as a goal to make all binary/bitstring
> functions be 0-based. One way is to deprecate the BIFs in 'erlang' and
> replace them, either there or in 'binary'.
>
> The first to go should be binary_to_list/3 which uses positions so you can't
> extract a zer0 length binary. Have bin_to_list(Binary, Start, Length)
> instead!

Yes. I fully and wholeheartedly agree! binary:bin_to_list/3 might be 
better than binary:to_list/3, as one might easilly expect 
binary:to_list to be an alias for (erlang:)binary_to_list...

>
>
>> - The suggested list conversion functions will be added (if noone
>>  objects strongly, with good, valid and sound arguments :)).
>> - Unless a large part of the community is for a change of the name sub
>>  (for instead of replace i suppose you mean, not instead of part?), I
>>  will not change that.
>>
>
> In this case I was thinking more in the lines of subbin instead of part. But
> I would prefer sub/gsub instead of replace as well. :-)

Yes, that would be nice - the problem of "sub" is that it can stand for so 
many things + that it's so tightly coupled to the implementation concept 
of sub-binaries.

Anyone else have some suggestions? - I hate naming issues :)

>
> Robert
>

So - new summary:
- I'll definitely clarify that the list of binaries is not a 
concatenation, but a *set* of alternatives.
- I would like to change to longest match, not lexically first, as I would 
like the order of the binaries in the list to be insignificant. I think 
they should be viewed as a set. This fits with old regexp module and 
original regular expression applications (like awk) too.
- I'll add the list conversion functions in this module (as before), but 
will deprecate the old binary_to_list/3 as well, as zero-based indexing 
is the rule.
- I think the intrusive functions (copy and referenced_byte_size) need to 
be there, but a large caution blob in the manual page is needed.

Thanks for really valuable input!

Cheers,
/Patrik