pcre, bifs, drivers and ports

Robert Virding robert.virding@REDACTED
Tue Aug 8 23:17:26 CEST 2006


Willem de Jong skrev:
> I think regexp is too slow on exactly the kind of pattern that was
> used in the 'benchmark': patterns that contain a long(ish) literal
> string, especially if  they start with it.
> 
> Regexp:parse() translates the pattern to a compiled RE that looks like
> this (for ABC): {concat, concat{A, B}, C}. As long as no match is
> found, the algorithm tries to match every character against the
> compiled RE. For every character, it has to traverse the tree down to
> the A. If the pattern is longer, the tree grows 'deeper' (one
> additional level with each additional character), and this process
> takes more time.

This way of parsing the regexp follows the definition: concatenation is 
left associative.

> The solution (I think) is to change the result of regexp:parse to
> something that looks like this: {concat, A, concat{B, C}}. Then it
> will find the A much quicker.

I don't think that transforming it in this way changes the semantics but 
I have not checked it. Does anyone know? If it doesn't then it should 
probably be done directly in the parser.

> I wrote a 'post-processing' module that modifies the compiled RE as
> described. I also changed a few things to the main function that does
> the matching (re_apply), but I am not sure whether that was really
> necessary. (If you want I can send you the code).
> 
> On my PC it is not so easy to measure response-time, but my impression
> is that this improved the result for the benchmark by a factor 3.

Please send me the core and I will check it out. I was surprised by the 
speed-up you mention. If it is of this order then it would be worth the 
effort. Given it doesn't change the semantics.

Robert



More information about the erlang-questions mailing list