[erlang-questions] Parsing with regexp (was regexp is slow)

Jay Nelson jay@REDACTED
Tue Nov 7 04:54:23 CET 2006

Thomas Lindgren wrote:

>/ i haven't looked at the precise problem, but would it
/>/ help to return a list of binaries instead?


/>/ In the larger scheme of things, it might be nice to
/>/ have some way to stream large binaries (aka map, fold,
/>/ ...) more transparently. Maybe Jay Nelson's paper at
/>/ the 2005 workshop could be a starting point?

Robert Virding wrote:

> In this case it wouldn't help at all as the result from each 
> substitution is passed directly into a new subst. You loop over a 
> pre-defined set of substitutions threading the result through.

The goal of the paper was to come up with a very efficient way to
handle exactly these types of situations, where a subset of a large
piece of data needs to be found / extracted and then transformed in
some way.  Using Joe's "inefficient coding style" works for small
problems, but the heap quickly blows up when things get converted
to lists.  If you filter them and garbage collect you might get by
but not if the data source is 10s of MBs.

regexps are concise and convenient in a lot of situations, but there
are two things I don't like: when the pattern gets to be so complicated
that it can no longer be understood, and when it leads to particularly
inefficient handling of data.  As long as the data size isn't too big,
the latter case shouldn't matter, unless it needs to execute many times.

In erlang, a more efficient approach (and the one I tried to embrace
in the paper), was to read the full data source as a single large binary.
Any filtering or extraction was done so that the result was a list of
sub-binaries pointing to the original data (essentially a struct with
a pointer and a length which can be constructed quickly) without creating
any intermediate structures or any garbage to collect.  The filtered
list could then be transformed, resulting in computational effort only
being taken on the elements of the data stream that mattered.

The fact that regexp uses substitution is an artifact of implementation
or of sticking to the user's understanding of the model.  It is not
necessary to solve the problem.

The sub-binary approach led to the natural desire for binary comprehensions.
When that is soon available in the language, this approach will be very clear
and reasonably efficient (although not as efficient as the BIFs, which
really screamed).

I have been working more on patterns, parsing and extraction, but I won't
have anything to say about it publicly for a few more months.


More information about the erlang-questions mailing list