[eeps] EEP 9

Fredrik Svahn fredrik.svahn@REDACTED
Wed Mar 5 23:42:43 CET 2008


>  i'll backpedal a bit here. what i really think should be avoided is
> backtracking. quoting from Cox (http://swtch.com/~rsc/regexp/regexp1.html);
>  "Regular expression matching can be simple and fast, using finite
> automata-based techniques that have been known for decades. In contrast,
> Perl, PCRE, Python, Ruby, Java, and many other languages have regular
> expression implementations based on recursive backtracking that are
> simple but can be excruciatingly slow."

This is indeed an excellent article, and I was quite concerned when I
read it as well. After looking a bit further into PCRE I realised that
PCRE seems to come with two alternative matching functions, an NFA
based and a DFA based

The NFA based clearly exhibits exponential behaviour as expected for
the nasty search string used as an example in the article. It will
eventually return an error -8 which means the backtracking limit has
been reached:

$ time ./pcredemo "(a?){19}a{19}" "aaaaaaaaaaaaaaaaaaa"
Match succeeded at offset 0
real 0m0.270s user 0m0.270s sys 0m0.000s
$ time ./pcredemo "(a?){20}a{20}" "aaaaaaaaaaaaaaaaaaaa"
Match succeeded at offset 0
real 0m0.519s user 0m0.520s sys 0m0.000s
$ time ./pcredemo "(a?){21}a{21}" "aaaaaaaaaaaaaaaaaaaaa"
Match succeeded at offset 0
real 0m1.039s user 0m1.030s sys 0m0.000s
$ time ./pcredemo "(a?){22}a{22}" "aaaaaaaaaaaaaaaaaaaaaa"
Matching error -8
real 0m1.652s user 0m1.650s sys 0m0.000s

The DFA based function, however, does not seem to exhibit any
exponential behaviour:

$ time ./pcredemo_dfa "(a?){21}a{21}" "aaaaaaaaaaaaaaaaaaaaa"
Match succeeded at offset 0
real 0m0.005s user 0m0.000s sys 0m0.000s
$ time ./pcredemo_dfa "(a?){22}a{22}" "aaaaaaaaaaaaaaaaaaaaaa"
Match succeeded at offset 0
real 0m0.005s user 0m0.000s sys 0m0.000s
$ time ./pcredemo_dfa "(a?){23}a{23}" "aaaaaaaaaaaaaaaaaaaaaaa"
Match succeeded at offset 0
real 0m0.006s user 0m0.010s sys 0m0.000s
$ time ./pcredemo_dfa "(a?){30}a{30}" "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
Match succeeded at offset 0
real 0m0.005s user 0m0.000s sys 0m0.010s

Maybe there should be an option { dfa | nfa } making it possible to
indicate which of the algorithms you want to use? The NFA one has more
features according to the manual pages: http://www.pcre.org/pcre.txt

BR /Fredrik



More information about the eeps mailing list