[erlang-questions] run strange behaviour

Thu Oct 24 18:01:38 CEST 2013

Thanks for explain.

2013/10/24 Vyacheslav Levytskyy <v.levytskyy@REDACTED>

>  Agree, it is similar to what I have written, the initial regex was the
> problem. It is not so important what is the number of the limit, from
> documentation we know that it exists ("maximum number of permitted times"),
> and anyway it is the regex what should be fixed.
>
> Vyacheslav
>
>
> On 24.10.2013 16:36, Erik Søe Sørensen wrote:
>
>    Being greedy or not shouldn't change whether the regex matches or not.
>  I believe the issue is something else...:
>  It's bad practise to have repetitions of something that matches the empty
> string - such as (\w* *)* - because that could be repeated any number of
> times.
>  Indeed, the original regex runs pretty slowly:
>
> 8> timer:tc(re, run, [<<"foo bar is a foo bar is a big yellow boat or">>,
> <<"^foo (\\w+(\\w* *)*) is a (\\w+(\\w* *)*)">>, [global, {capture, [1,3],
> binary}]]).
> {1052818,
>  {match,[[<<"bar is a foo bar">>,<<"big yellow boat or">>]]}}
>
>  My guess, therefore, is that the regexp times out/reaches some
> run-duration limit on the longer input string.
>
>  Fixing the regex to not have repetitions of something that matches ""
> helps both the run time:
>
> 9> timer:tc(re, run, [<<"foo bar is a foo bar is a big yellow boat or">>,
> <<"^foo (\\w+ *(\\w+ *)*) is a (\\w+ *(\\w+ *)*)">>, [global, {capture,
> [1,3], binary}]]).
> {14315,
>  {match,[[<<"bar is a foo bar">>,<<"big yellow boat or">>]]}}
>  % 74x faster :-)
>
>  and the result for the longer input string:
>
> 10> timer:tc(re, run, [<<"foo bar is a foo bar is a big yellow boat or
> sub">>, <<"^foo (\\w+ *(\\w+ *)*) is a (\\w+ *(\\w+ *)*)">>, [global,
> {capture, [1,3], binary}]]).
> {49911,
>  {match,[[<<"bar is a foo bar">>,
>           <<"big yellow boat or sub">>]]}}
>
> It is good results.

BTW I find more efficient regex:

1> timer:tc(fun() -> re:run(<<"foo bar is a foo bar is a big yellow boat or
sub">>, <<"^foo (\\w+(\\w| )*) is a (\\w+(\\w| )*)$">>, [global, {capture,
[1,3], binary}]) end).
{27190,
 {match,[[<<"bar is a foo bar">>, <<"big yellow boat or sub">>]]}}

2> timer:tc(fun() -> re:run(<<"foo bar is a foo bar is a big yellow boat or
sub">>, <<"^foo (\\w+(\\w| )*) is a (\\w+(\\w| )*)$">>, [global, {capture,
[1,3], binary}]) end).
{143,
 {match,[[<<"bar is a foo bar">>, <<"big yellow boat or sub">>]]}}

3> timer:tc(fun() -> re:run(<<"foo bar is a foo bar is a big yellow boat
or">>, <<"^foo (\\w+(\\w| )*) is a (\\w+(\\w| )*)$">>, [global, {capture,
[1,3], binary}]) end).
{138,
 {match,[[<<"bar is a foo bar">>,<<"big yellow boat or">>]]}

>
>  /Erik
>
>
>  2013/10/24 Vyacheslav Levytskyy <v.levytskyy@REDACTED>
>
>>  Hello,
>>
>> According to 're' module documentation, "the quantifiers are "greedy",
>> that is, they match as much as possible (up to the maximum number of
>> permitted times)". This seems to be a problem with your case. The regex you
>> are using seems a bit problematic, forcing 're' to exhausting repetitions.
>>
>> As an option, you can use 'ungreedy' option, making only some of
>> quantifiers greedy via following them by "?". See for example:
>> re:run(<<"foo bar is a foo bar is a big yellow boat or sub">>, <<"^foo
>> (\\w(\\w+| )*) is a (\\w(\\w+?| )*?)">>, [ungreedy, global, {capture,
>> [1,3], binary}]).
>> {match,[[<<"bar">>,
>>          <<"foo bar is a big yellow boat or sub">>]]}
>>
>> Best regards,
>> Vyacheslav Levytskyy
>>
>>
>> On 23.10.2013 22:26, Alexander Petrovsky wrote:
>>
>>  Hi!
>>
>>  I have the regex "^foo (\\w+(\\w* *)*) is an (\\w+(\\w* *)*)", and I
>> get strange behaviour when I do:
>>
>>  1> re:run(<<"foo bar is a foo bar is a big yellow boat or">>, <<"^foo
>> (\\w+(\\w* *)*) is a (\\w+(\\w* *)*)">>, [global, {capture, [1,3],
>> binary}]).
>> {match,[[<<"bar is a foo bar">>,<<"big yellow boat or">>]]}
>>
>>  2> re:run(<<"foo bar is a foo bar is a big yellow boat or sub">>,
>> <<"^foo (\\w+(\\w* *)*) is a (\\w+(\\w* *)*)">>, [global, {capture, [1,3],
>> binary}]).
>> nomatch
>>
>>  I tested this regexp in clojure and python:
>>
>>  => (re-matches #"foo (\w+(\w* *)*) is a (\w+(\w* *)*)" "foo bar is a
>> foo bar is a big yellow boat or")
>> ["foo bar is a foo bar is a big yellow boat or" "bar is a foo bar" ""
>> "big yellow boat or" ""]
>>
>>  => (re-matches #"foo (\w+(\w* *)*) is a (\w+(\w* *)*)" "foo bar is a
>> foo bar is a big yellow boat or sub")
>> ["foo bar is a foo bar is a big yellow boat or sub" "bar is a foo bar" ""
>> "big yellow boat or sub" ""]
>>
>>  >>> import re
>> >>> p = re.compile('foo (\w+(\w* *)*) is a (\w+(\w* *)*)')
>> >>> p.match("foo bar is a foo bar is a big yellow boat or")
>> <_sre.SRE_Match object at 0x100293c00>
>> >>> p.match("foo bar is a foo bar is a big yellow boat or sub")
>> <_sre.SRE_Match object at 0x100293ab0>
>>
>>  Can someone explain me, why I get on second string "foo bar is a foo
>> bar is a big yellow boat or sub" nomatch? This is a bug?
>>
>>
>>  --
>> Петровский Александр / Alexander Petrovsky,
>>
>> Skype: askjuise
>> Jabber: juise@REDACTED
>> Phone: +7 914 8 820 815 <%2B7%20914%208%20820%20815> (irkutsk)
>>
>>
>>
>>  _______________________________________________
>> erlang-questions mailing listerlang-questions@REDACTED://erlang.org/mailman/listinfo/erlang-questions
>>
>>
>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
>>
>>
>
>

-- 
Петровский Александр / Alexander Petrovsky,

Skype: askjuise
Jabber: juise@REDACTED
Phone: +7 914 8 820 815 (irkutsk)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20131024/0a70ad24/attachment.htm>