[erlang-bugs] Bug with named subpatterns in re module

Sergei Golovan sgolovan@REDACTED
Thu Mar 28 12:59:04 CET 2013


Hi!

On Thu, Mar 28, 2013 at 3:35 PM, Patrik Nyblom <pan@REDACTED> wrote:
>
> I'm unsure of the nature of this bug. What are you actually expecting as a
> return when you use duplicate names and named capture? Both instances of the
> name, "the right instance" of the name or a badarg?

At least the results should not depend on the pattern names.

When I run the following Perl script:

#! /usr/bin/perl

$var = 'bar';
$var =~ m/^(?<a>foo)(?<b>bla)$|^(?<a>[[:word:]]+)$/;
pplus();
$var =~ m/^(?<b>foo)(?<a>bla)$|^(?<b>[[:word:]]+)$/;
pplus();

sub pplus {
    foreach (keys %+) {
        print "$_: $+{$_}\n";
    }
}

It prints the following:

a: bar
b: bar

Which means that it captures the only matching pattern. Perl docs say
that in case of duplicate names the leftmost matched one is captured.
I would say that the less the difference in behavior in re and the
original Perl regexp the better.

>
> I.e would you like
>
>
> re:run("bar", "^(?<b>foo)(?<a>bla)$|^(?<b>[[:word:]]+)$",[dupnames,
> {capture, [a, b], list}]).
>
> to give the same result as:
>
> re:run("bar", "^(?<b>foo)(?<a>bla)$|^(?<c>[[:word:]]+)$",[dupnames,
> {capture, [a, b, c], list}]).
>
> ? Or return the second instance if that matches, but the first instance if
> that one matches? Or should we simply not allow it? The thing is that even
> with dupnames, you have a varying amount of subexpressions. Capturing 'all'
> (or rather 'all_but_first') will show you that this call returns three
> distinct subexpressions, of which two happen to have the same name
> (regardless of the names). If the part before | matches, the result is only
> two subexpressions, as the first two subexpressions match. No duplicate
> naming will change this. There is no real "select the one that matches"
> functionality in giving two subexpressions the same name.
>
> PCRE just picks one of the occurences of a name when you ask for it - in
> your last example the occurence you were not expecting, but that's more or
> less random, the first example would give unexpected results if the first
> part matched. PCRE has no functionality to pick all occurences of a name,
> but that could of course be changed if there was some understandable
> semantics that should be implemented. I think badarg exception is the way to
> go though...

Well, re manpage says that dupnames is helpful in case when it's
certain that two subpatterns with the same name can't be matched
simultaneously. Fortunately, the considered regexp falls in this
category. So, I guess that either dupnames has to be removed at all,
or something should be done with it.

Cheers!
-- 
Sergei Golovan



More information about the erlang-bugs mailing list