[erlang-questions] run strange behaviour
Alexander Petrovsky
askjuise@REDACTED
Thu Oct 24 18:01:38 CEST 2013
Thanks for explain.
2013/10/24 Vyacheslav Levytskyy <v.levytskyy@REDACTED>
> Agree, it is similar to what I have written, the initial regex was the
> problem. It is not so important what is the number of the limit, from
> documentation we know that it exists ("maximum number of permitted times"),
> and anyway it is the regex what should be fixed.
>
> Vyacheslav
>
>
> On 24.10.2013 16:36, Erik Søe Sørensen wrote:
>
> Being greedy or not shouldn't change whether the regex matches or not.
> I believe the issue is something else...:
> It's bad practise to have repetitions of something that matches the empty
> string - such as (\w* *)* - because that could be repeated any number of
> times.
> Indeed, the original regex runs pretty slowly:
>
> 8> timer:tc(re, run, [<<"foo bar is a foo bar is a big yellow boat or">>,
> <<"^foo (\\w+(\\w* *)*) is a (\\w+(\\w* *)*)">>, [global, {capture, [1,3],
> binary}]]).
> {1052818,
> {match,[[<<"bar is a foo bar">>,<<"big yellow boat or">>]]}}
>
> My guess, therefore, is that the regexp times out/reaches some
> run-duration limit on the longer input string.
>
> Fixing the regex to not have repetitions of something that matches ""
> helps both the run time:
>
> 9> timer:tc(re, run, [<<"foo bar is a foo bar is a big yellow boat or">>,
> <<"^foo (\\w+ *(\\w+ *)*) is a (\\w+ *(\\w+ *)*)">>, [global, {capture,
> [1,3], binary}]]).
> {14315,
> {match,[[<<"bar is a foo bar">>,<<"big yellow boat or">>]]}}
> % 74x faster :-)
>
> and the result for the longer input string:
>
> 10> timer:tc(re, run, [<<"foo bar is a foo bar is a big yellow boat or
> sub">>, <<"^foo (\\w+ *(\\w+ *)*) is a (\\w+ *(\\w+ *)*)">>, [global,
> {capture, [1,3], binary}]]).
> {49911,
> {match,[[<<"bar is a foo bar">>,
> <<"big yellow boat or sub">>]]}}
>
> It is good results.
BTW I find more efficient regex:
1> timer:tc(fun() -> re:run(<<"foo bar is a foo bar is a big yellow boat or
sub">>, <<"^foo (\\w+(\\w| )*) is a (\\w+(\\w| )*)$">>, [global, {capture,
[1,3], binary}]) end).
{27190,
{match,[[<<"bar is a foo bar">>, <<"big yellow boat or sub">>]]}}
2> timer:tc(fun() -> re:run(<<"foo bar is a foo bar is a big yellow boat or
sub">>, <<"^foo (\\w+(\\w| )*) is a (\\w+(\\w| )*)$">>, [global, {capture,
[1,3], binary}]) end).
{143,
{match,[[<<"bar is a foo bar">>, <<"big yellow boat or sub">>]]}}
3> timer:tc(fun() -> re:run(<<"foo bar is a foo bar is a big yellow boat
or">>, <<"^foo (\\w+(\\w| )*) is a (\\w+(\\w| )*)$">>, [global, {capture,
[1,3], binary}]) end).
{138,
{match,[[<<"bar is a foo bar">>,<<"big yellow boat or">>]]}
>
> /Erik
>
>
> 2013/10/24 Vyacheslav Levytskyy <v.levytskyy@REDACTED>
>
>> Hello,
>>
>> According to 're' module documentation, "the quantifiers are "greedy",
>> that is, they match as much as possible (up to the maximum number of
>> permitted times)". This seems to be a problem with your case. The regex you
>> are using seems a bit problematic, forcing 're' to exhausting repetitions.
>>
>> As an option, you can use 'ungreedy' option, making only some of
>> quantifiers greedy via following them by "?". See for example:
>> re:run(<<"foo bar is a foo bar is a big yellow boat or sub">>, <<"^foo
>> (\\w(\\w+| )*) is a (\\w(\\w+?| )*?)">>, [ungreedy, global, {capture,
>> [1,3], binary}]).
>> {match,[[<<"bar">>,
>> <<"foo bar is a big yellow boat or sub">>]]}
>>
>> Best regards,
>> Vyacheslav Levytskyy
>>
>>
>> On 23.10.2013 22:26, Alexander Petrovsky wrote:
>>
>> Hi!
>>
>> I have the regex "^foo (\\w+(\\w* *)*) is an (\\w+(\\w* *)*)", and I
>> get strange behaviour when I do:
>>
>> 1> re:run(<<"foo bar is a foo bar is a big yellow boat or">>, <<"^foo
>> (\\w+(\\w* *)*) is a (\\w+(\\w* *)*)">>, [global, {capture, [1,3],
>> binary}]).
>> {match,[[<<"bar is a foo bar">>,<<"big yellow boat or">>]]}
>>
>> 2> re:run(<<"foo bar is a foo bar is a big yellow boat or sub">>,
>> <<"^foo (\\w+(\\w* *)*) is a (\\w+(\\w* *)*)">>, [global, {capture, [1,3],
>> binary}]).
>> nomatch
>>
>> I tested this regexp in clojure and python:
>>
>> => (re-matches #"foo (\w+(\w* *)*) is a (\w+(\w* *)*)" "foo bar is a
>> foo bar is a big yellow boat or")
>> ["foo bar is a foo bar is a big yellow boat or" "bar is a foo bar" ""
>> "big yellow boat or" ""]
>>
>> => (re-matches #"foo (\w+(\w* *)*) is a (\w+(\w* *)*)" "foo bar is a
>> foo bar is a big yellow boat or sub")
>> ["foo bar is a foo bar is a big yellow boat or sub" "bar is a foo bar" ""
>> "big yellow boat or sub" ""]
>>
>> >>> import re
>> >>> p = re.compile('foo (\w+(\w* *)*) is a (\w+(\w* *)*)')
>> >>> p.match("foo bar is a foo bar is a big yellow boat or")
>> <_sre.SRE_Match object at 0x100293c00>
>> >>> p.match("foo bar is a foo bar is a big yellow boat or sub")
>> <_sre.SRE_Match object at 0x100293ab0>
>>
>> Can someone explain me, why I get on second string "foo bar is a foo
>> bar is a big yellow boat or sub" nomatch? This is a bug?
>>
>>
>> --
>> Петровский Александр / Alexander Petrovsky,
>>
>> Skype: askjuise
>> Jabber: juise@REDACTED
>> Phone: +7 914 8 820 815 <%2B7%20914%208%20820%20815> (irkutsk)
>>
>>
>>
>> _______________________________________________
>> erlang-questions mailing listerlang-questions@REDACTED://erlang.org/mailman/listinfo/erlang-questions
>>
>>
>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
>>
>>
>
>
--
Петровский Александр / Alexander Petrovsky,
Skype: askjuise
Jabber: juise@REDACTED
Phone: +7 914 8 820 815 (irkutsk)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20131024/0a70ad24/attachment.htm>
More information about the erlang-questions
mailing list