[erlang-questions] Fast regular expression implementation

Thu Dec 21 09:26:42 CET 2006

Yariv Sadan wrote:
> Hi Gaspar,
> 
> Have you ran any benchmarks comparing your implementation to the OTP
> regexp and/or the revised on on trapexit? Also, can you please give us
> a hint as to what makes your implementation faster?

Well. I would describe in short - because it's really too much changes 
there.

First of all - after parsing RE to tree I begin to optimize it's 
structure, joining consequential chars to strings,

char_class, comp_class optimizations - there are separate cases for
1 char - it is stored as just 1 integer
1 range of symbols - it is stored  as  a tuple
until 30 non adjacent symbols  -- it is stored as a list
more than 30 non adjacent symbols in a char class  -- they are converted 
to tuple of size 256 with true/false values - for faster lookups

there is also some internal transformations of the RE, so after 
transform it is a list with following elements:

{const, "string"} -- match a constant string
{kclosure, RE} -- match RE 0 or more times (* quantifier)
{kclosure_ng, RE} -- match RE 0 or more times -- non greedy (*? quantifier)
{closure, {0, Max}, RE} -- match RE from 0 to Max times ({N,M} 
quantifier -- it's quite useful and implemented in perl regexps)
{char_class, Class}, {comp_class, Class} - character class and it's 
negation (for instance [a-z] and [^a-z0-9] classes)
{match, RE} - extract submatch corresponding to embedded RE
any_symbol - special case for . symbol (matches anything but \n)

all RE embedded in a commands are also lists of same commands.

This is more convenient structure to process, than tree which is 
constructed with nested tuples.

Second part is a RE interpreter optimization:

First speed up was that regexp/gregexp do practically linear scan of 
incoming string, in case if you can just exactly to some position -- 
i.e. if you have constant string in the begin of the RE, you can 
directly just to it's first match and try to match whole RE there. Old 
modules tried to cut one symbol if RE is not matched and try matching 
again. The same is done it the first part of the pattern is a char_class 
or comp_class.

char class match optimizations -- I've described above -- for different 
char_class sizes different structures are used.

{kclosure, RE_kclos} optimization -- (g)regexp were trying to execute 
BOTH paths (with pattern and without pattern) and find maximal match. 
This highly recursive call (not a tail recursion, btw) is replaced with 
tail recursion call, which tries match one pattern a time and then 
decide go apply more RE_kclos patterns or try match remaining RE.

{kclosure_ng, RE_kclos) (non greedy) was the easiest in optimization - 
try remaining RE, if we matched -- we are finished, if not -- try apply 
RE_kclos, if it matched -- apply RE again, if not -- report that we 
cannot match pattern in this position. It's up to previous expressions
in a chain to decide how to change match position.

Same optimization is done for {closure, {0, Max}, RE} expression, which 
is matched in a greedy manner. There is no non-gredy implementation now, 
but it should be added.

There is also special optimizations for cases:
.*?string
.*?(string)

so, .*? part, which in practice matches any symbol is just jumped over 
and not matched by one symbol every time. In this case . means really - 
any symbol, even \n, so I'm thinking of changing any_symbol 
specification and make it match any symbol.

About benchmars:

regexp/gregexp were hanging on my task :) I where matching
text.*?text1.*?text2.*?text3 pattern to 18 Kb file.

I've tried to use profiler -- fprof just hang after eating several gigs 
of trace.log, cover was more successful.

Regexp/gregexp called re_apply_list (main entry function in a re 
matching loop) about 3,200,000 times on a 18Kb file.
My regexp implementatin for the same pattern called re_apply_list
for class=g.*?<a\s+class=l\s+href=\"(.*?)\">(.*?)</a> pattern on a same 
18Kb file about 160 times)

re.erl now spends most of it's time in the lists:nthtail and strings:str 
functions and this is difficult to optimize more.

I had also major slowdown in extracting string matches/submatches (m*g, 
m*gg functions), but now it is solved also -- and you are recommended to 
use them to extract submatches -- because it will works only on the 
remainder of the string, which decreases scan time on the list -- and 
thus produce results faster).

That's all :)

PS. I should put copy of this into re.erl as a documentation ;)

/Gaspar

-- 
Gaspar Chilingarov

System Administrator,
Network security consulting

t +37493 419763 (mob)
i 63174784
e nm@REDACTED