[erlang-questions] regexp is slow
Robert Virding
robert.virding@REDACTED
Mon Nov 6 23:40:06 CET 2006
Thomas Lindgren wrote:
>
> --- Robert Virding <robert.virding@REDACTED> wrote:
>
>
>>2. counting occuring patterns regexp:matches
>>
>>Counting patterns was no problem, that went fast. It
>>was the
>>substitution part that was taking most of the time.
>>Not finding the
>>matching parts of the data but doing the actual
>>substitutions. This is
>>one part of the code which much be improved, it was
>>never considered
>>that it would process such large amounts of data.
>>Having the data in a
>>binary would definitely NOT help here, it would
>>result in an enormous
>>amount of copying.
>
>
> i haven't looked at the precise problem, but would it
> help to return a list of binaries instead? Cut out the
> match and put in the substitute instead:
>
> [...,
> <<"before match">>,
> <<"substitution">>,
> <<"between matches">>,
> <<"next subst">>,
> ...
> ]
>
> In the larger scheme of things, it might be nice to
> have some way to stream large binaries (aka map, fold,
> ...) more transparently. Maybe Jay Nelson's paper at
> the 2005 workshop could be a starting point?
In this case it wouldn't help at all as the result from each
substitution is passed directly into a new subst. You loop over a
pre-defined set of substitutions threading the result through.
Both the problem and the way it is to be solved is defined:
1. read in the file and pass through a subst which removes all newlines
and some specific lines.
2. count the number occurences of about 10 regexps.
3. do a sequence of substitutions, about 10-12, on the data.
4. write out some statistics.
This means that you either need to have <format> -> <format> operations
(lists, binaries, whatever) or be able to handle complex combinations.
Robert
More information about the erlang-questions
mailing list