[erlang-questions] regexp is slow

Mon Nov 6 23:40:06 CET 2006

Thomas Lindgren wrote:
> 
> --- Robert Virding <robert.virding@REDACTED> wrote:
> 
> 
>>2. counting occuring patterns regexp:matches
>>
>>Counting patterns was no problem, that went fast. It
>>was the 
>>substitution part that was taking most of the time.
>>Not finding the 
>>matching parts of the data but doing the actual
>>substitutions. This is 
>>one part of the code which much be improved, it was
>>never considered 
>>that it would process such large amounts of data.
>>Having the data in a 
>>binary would definitely NOT help here, it would
>>result in an enormous 
>>amount of copying.
> 
> 
> i haven't looked at the precise problem, but would it
> help to return a list of binaries instead? Cut out the
> match and put in the substitute instead:
> 
>   [...,
>   <<"before match">>, 
>   <<"substitution">>, 
>   <<"between matches">>, 
>   <<"next subst">>,
>   ...
>   ]
> 
> In the larger scheme of things, it might be nice to
> have some way to stream large binaries (aka map, fold,
> ...) more transparently. Maybe Jay Nelson's paper at
> the 2005 workshop could be a starting point?

In this case it wouldn't help at all as the result from each 
substitution is passed directly into a new subst. You loop over a 
pre-defined set of substitutions threading the result through.

Both the problem and the way it is to be solved is defined:

1. read in the file and pass through a subst which removes all newlines 
and some specific lines.

2. count the number occurences of about 10 regexps.

3. do a sequence of substitutions, about 10-12, on the data.

4. write out some statistics.

This means that you either need to have <format> -> <format> operations 
(lists, binaries, whatever) or be able to handle complex combinations.

Robert