pcre, bifs, drivers and ports

Wed Aug 9 19:48:07 CEST 2006

Hello Robert,

I looked again: for the pattern in the 'benchmark' ('con[a-z]*') it is
more than twice as fast. For longer patterns the difference gets
bigger; if the literal string is 5 characters long it is about 3 times
as fast (disclaimer: on my PC response time is difficult to measure
exactly, I had to use a very long string to get meaningful results).

I also tried what how long it would take to search for a match when
using a finite state machine that was written specifically for the
pattern from the benchmark. This was MUCH faster still. It gives me
the impression that it should be possible to make further
improvements.

Regards,
Willem

On 8/8/06, Robert Virding <robert.virding@REDACTED> wrote:
> Willem de Jong skrev:
> > I think regexp is too slow on exactly the kind of pattern that was
> > used in the 'benchmark': patterns that contain a long(ish) literal
> > string, especially if  they start with it.
> >
> > Regexp:parse() translates the pattern to a compiled RE that looks like
> > this (for ABC): {concat, concat{A, B}, C}. As long as no match is
> > found, the algorithm tries to match every character against the
> > compiled RE. For every character, it has to traverse the tree down to
> > the A. If the pattern is longer, the tree grows 'deeper' (one
> > additional level with each additional character), and this process
> > takes more time.
>
> This way of parsing the regexp follows the definition: concatenation is
> left associative.
>
> > The solution (I think) is to change the result of regexp:parse to
> > something that looks like this: {concat, A, concat{B, C}}. Then it
> > will find the A much quicker.
>
> I don't think that transforming it in this way changes the semantics but
> I have not checked it. Does anyone know? If it doesn't then it should
> probably be done directly in the parser.
>
> > I wrote a 'post-processing' module that modifies the compiled RE as
> > described. I also changed a few things to the main function that does
> > the matching (re_apply), but I am not sure whether that was really
> > necessary. (If you want I can send you the code).
> >
> > On my PC it is not so easy to measure response-time, but my impression
> > is that this improved the result for the benchmark by a factor 3.
>
> Please send me the core and I will check it out. I was surprised by the
> speed-up you mention. If it is of this order then it would be worth the
> effort. Given it doesn't change the semantics.
>
> Robert
>

I post-processed the RE using the function below. Obviously the parser
should do this directly, this is just a hack.

-----------------
%% translates {concat, concat{A, B}, C} to
%% {concat, A, concat{B, C}}
%% First make a list ABC, and then form it into the second form.
postProcess(RE) ->
  reconstruct(deconstruct(RE)).

deconstruct({concat, A, B}) ->
  deconstruct(A) ++ deconstruct(B);

deconstruct({Type, A, B}) ->
  [{Type, deconstruct(A), deconstruct(B)}];

deconstruct(A) ->
  [A].

reconstruct({Type, A, B}) ->
  {Type, reconstruct(A), reconstruct(B)};

reconstruct([A, B]) when B /= [] ->
  {concat, reconstruct(A), reconstruct(B)};
reconstruct([A | Tail]) when Tail /= [] ->
  {concat, reconstruct(A), reconstruct(Tail)};

reconstruct([A]) ->
  A;
reconstruct(A) ->
  A.
-------------------
It looks like you can use such a re-written RE with the existing match
function, and improve speed by a factor of around 1.5 (but I didn't
test this extensively).

I modified re_apply as well. In its original form, the 'nomatch'
result is only found if all other patterns fail. It is sort of a
catch-all:

re_apply(_RE, _More, _S, _P) -> nomatch.

I don't know how the matching algorithm works, but I guess that this
is expensive: all other patterns have to be tried first, and this
happens a lot.

So I changed the re_apply() function as shown below. Note that this
may slow down some cases where there are few literal characters in the
RE (but only a bit).

Both changes together give significant improvements for long literal
strings in the RE. (But of course this has to be tested etc...)

------------------------
%% re_apply is the main function.
re_apply(S, St, RE) -> re_apply(RE, [], S, St).

re_apply(epsilon, More, S, P) ->		%This always matches
    re_apply_more(More, S, P);
re_apply({'or',RE1,RE2}, More, S, P) ->
    re_apply_or(re_apply(RE1, More, S, P),
		re_apply(RE2, More, S, P));

%% WdJ:
%% in stead of making the no-match situation the pattern that
%% only matches if all else fails, we look for it.
%% if I remember correctly, the main thing I changed to re_apply is adding
%% the 4 lines below.
re_apply({concat,C,RE2}, More, [C|S], P) ->
    re_apply(RE2, More, S, P+1);
re_apply({concat, C, _}, _More, _, _P) when integer(C) ->
   nomatch;

re_apply({concat,RE1,RE2}, More, S0, P) ->
    re_apply(RE1, [RE2|More], S0, P);
re_apply({kclosure,CE}, More, S, P) ->
    %% Be careful with the recursion, explicitly do one call before
    %% looping.
    re_apply_or(re_apply_more(More, S, P),
		re_apply(CE, [{kclosure,CE}|More], S, P));
re_apply({pclosure,CE}, More, S, P) ->
    re_apply(CE, [{kclosure,CE}|More], S, P);
re_apply({optional,CE}, More, S, P) ->
    re_apply_or(re_apply_more(More, S, P),
		re_apply(CE, More, S, P));
re_apply(bos, More, S, 1) -> re_apply_more(More, S, 1);
re_apply(eos, More, [$\n|S], P) -> re_apply_more(More, S, P);
re_apply(eos, More, [], P) -> re_apply_more(More, [], P);
re_apply({char_class,Cc}, More, [C|S], P) ->
    case in_char_class(C, Cc) of
	true -> re_apply_more(More, S, P+1);
	false -> nomatch
    end;
re_apply({comp_class,Cc}, More, [C|S], P) ->
    case in_char_class(C, Cc) of
	true -> nomatch;
	false -> re_apply_more(More, S, P+1)
    end;
re_apply(C, More, [C|S], P) when integer(C) ->
    re_apply_more(More, S, P+1);
re_apply(_RE, _More, _S, _P) -> nomatch.

%% re_apply_more([RegExp], String, Length) -> re_app_res().

%% WdJ: I believe I made a small change here as well.
re_apply_more([RE|More], S, P) -> re_apply(RE, More, S, P);
re_apply_more([], S, P) -> {match,P,S}.