[erlang-questions] Fast regular expression implementation
Gaspar Chilingarov
nm@REDACTED
Thu Dec 21 09:26:42 CET 2006
Yariv Sadan wrote:
> Hi Gaspar,
>
> Have you ran any benchmarks comparing your implementation to the OTP
> regexp and/or the revised on on trapexit? Also, can you please give us
> a hint as to what makes your implementation faster?
Well. I would describe in short - because it's really too much changes
there.
First of all - after parsing RE to tree I begin to optimize it's
structure, joining consequential chars to strings,
char_class, comp_class optimizations - there are separate cases for
1 char - it is stored as just 1 integer
1 range of symbols - it is stored as a tuple
until 30 non adjacent symbols -- it is stored as a list
more than 30 non adjacent symbols in a char class -- they are converted
to tuple of size 256 with true/false values - for faster lookups
there is also some internal transformations of the RE, so after
transform it is a list with following elements:
{const, "string"} -- match a constant string
{kclosure, RE} -- match RE 0 or more times (* quantifier)
{kclosure_ng, RE} -- match RE 0 or more times -- non greedy (*? quantifier)
{closure, {0, Max}, RE} -- match RE from 0 to Max times ({N,M}
quantifier -- it's quite useful and implemented in perl regexps)
{char_class, Class}, {comp_class, Class} - character class and it's
negation (for instance [a-z] and [^a-z0-9] classes)
{match, RE} - extract submatch corresponding to embedded RE
any_symbol - special case for . symbol (matches anything but \n)
all RE embedded in a commands are also lists of same commands.
This is more convenient structure to process, than tree which is
constructed with nested tuples.
Second part is a RE interpreter optimization:
First speed up was that regexp/gregexp do practically linear scan of
incoming string, in case if you can just exactly to some position --
i.e. if you have constant string in the begin of the RE, you can
directly just to it's first match and try to match whole RE there. Old
modules tried to cut one symbol if RE is not matched and try matching
again. The same is done it the first part of the pattern is a char_class
or comp_class.
char class match optimizations -- I've described above -- for different
char_class sizes different structures are used.
{kclosure, RE_kclos} optimization -- (g)regexp were trying to execute
BOTH paths (with pattern and without pattern) and find maximal match.
This highly recursive call (not a tail recursion, btw) is replaced with
tail recursion call, which tries match one pattern a time and then
decide go apply more RE_kclos patterns or try match remaining RE.
{kclosure_ng, RE_kclos) (non greedy) was the easiest in optimization -
try remaining RE, if we matched -- we are finished, if not -- try apply
RE_kclos, if it matched -- apply RE again, if not -- report that we
cannot match pattern in this position. It's up to previous expressions
in a chain to decide how to change match position.
Same optimization is done for {closure, {0, Max}, RE} expression, which
is matched in a greedy manner. There is no non-gredy implementation now,
but it should be added.
There is also special optimizations for cases:
.*?string
.*?(string)
so, .*? part, which in practice matches any symbol is just jumped over
and not matched by one symbol every time. In this case . means really -
any symbol, even \n, so I'm thinking of changing any_symbol
specification and make it match any symbol.
About benchmars:
regexp/gregexp were hanging on my task :) I where matching
text.*?text1.*?text2.*?text3 pattern to 18 Kb file.
I've tried to use profiler -- fprof just hang after eating several gigs
of trace.log, cover was more successful.
Regexp/gregexp called re_apply_list (main entry function in a re
matching loop) about 3,200,000 times on a 18Kb file.
My regexp implementatin for the same pattern called re_apply_list
for class=g.*?<a\s+class=l\s+href=\"(.*?)\">(.*?)</a> pattern on a same
18Kb file about 160 times)
re.erl now spends most of it's time in the lists:nthtail and strings:str
functions and this is difficult to optimize more.
I had also major slowdown in extracting string matches/submatches (m*g,
m*gg functions), but now it is solved also -- and you are recommended to
use them to extract submatches -- because it will works only on the
remainder of the string, which decreases scan time on the list -- and
thus produce results faster).
That's all :)
PS. I should put copy of this into re.erl as a documentation ;)
/Gaspar
--
Gaspar Chilingarov
System Administrator,
Network security consulting
t +37493 419763 (mob)
i 63174784
e nm@REDACTED
More information about the erlang-questions
mailing list