[erlang-bugs] Bug with named subpatterns in re module

Thu Mar 28 17:13:25 CET 2013

On 03/28/2013 12:59 PM, Sergei Golovan wrote:
> Hi!
>
> On Thu, Mar 28, 2013 at 3:35 PM, Patrik Nyblom <pan@REDACTED> wrote:
>> I'm unsure of the nature of this bug. What are you actually expecting as a
>> return when you use duplicate names and named capture? Both instances of the
>> name, "the right instance" of the name or a badarg?
> At least the results should not depend on the pattern names.
No, definitely not - the results now are more or less random, so 
something needs to be done-
>
> When I run the following Perl script:
>
> #! /usr/bin/perl
>
> $var = 'bar';
> $var =~ m/^(?<a>foo)(?<b>bla)$|^(?<a>[[:word:]]+)$/;
> pplus();
> $var =~ m/^(?<b>foo)(?<a>bla)$|^(?<b>[[:word:]]+)$/;
> pplus();
>
> sub pplus {
>      foreach (keys %+) {
>          print "$_: $+{$_}\n";
>      }
> }
>
> It prints the following:
>
> a: bar
> b: bar
>
> Which means that it captures the only matching pattern. Perl docs say
> that in case of duplicate names the leftmost matched one is captured.
> I would say that the less the difference in behavior in re and the
> original Perl regexp the better.
Okay, thanks for explaining!

The leftmost matching might be doable - the pcre_get_stringtable_entries 
can be used and we could then extract the first entry for that name that 
is bound. We now use pcre_get_stringnumber, which gives a random 
instance of that name and should not be used with dupnames.

What about "all" then, it returns all bound indexes and will possibly 
return the duplicate name's binding twice, once as [] and once as "bar" 
(in your example). Should it skip a binding where the same name is bound 
later, or should it return them all, as it does now? 'all' kind of means 
"all indexes" rather than "all names". Should we add "all_names" to get 
the behavior that you demonstrate in your Perl program? Or maybe just 
let 'all' be as is and just fix the thing where you specifically list 
names... Hmmm - thoughts?
>
>> I.e would you like
>>
>>
>> re:run("bar", "^(?<b>foo)(?<a>bla)$|^(?<b>[[:word:]]+)$",[dupnames,
>> {capture, [a, b], list}]).
>>
>> to give the same result as:
>>
>> re:run("bar", "^(?<b>foo)(?<a>bla)$|^(?<c>[[:word:]]+)$",[dupnames,
>> {capture, [a, b, c], list}]).
>>
>> ? Or return the second instance if that matches, but the first instance if
>> that one matches? Or should we simply not allow it? The thing is that even
>> with dupnames, you have a varying amount of subexpressions. Capturing 'all'
>> (or rather 'all_but_first') will show you that this call returns three
>> distinct subexpressions, of which two happen to have the same name
>> (regardless of the names). If the part before | matches, the result is only
>> two subexpressions, as the first two subexpressions match. No duplicate
>> naming will change this. There is no real "select the one that matches"
>> functionality in giving two subexpressions the same name.
>>
>> PCRE just picks one of the occurences of a name when you ask for it - in
>> your last example the occurence you were not expecting, but that's more or
>> less random, the first example would give unexpected results if the first
>> part matched. PCRE has no functionality to pick all occurences of a name,
>> but that could of course be changed if there was some understandable
>> semantics that should be implemented. I think badarg exception is the way to
>> go though...
> Well, re manpage says that dupnames is helpful in case when it's
> certain that two subpatterns with the same name can't be matched
> simultaneously. Fortunately, the considered regexp falls in this
> category. So, I guess that either dupnames has to be removed at all,
> or something should be done with it.
Funny that I wrote that, when I very well knew that the PCRE API's I 
used did not work with dupnames :)

Well, removing dupnames might be the easiest, but as there are perl 
semantics we can imitate, I think we should give it a try!
>
> Cheers!
Cheers,
/Patrik