[erlang-questions] Parsing with regexp (was regexp is slow)

Thu Nov 9 02:32:09 CET 2006

OK, a few comments.

The program which was mentioned was specifically for bench-marking 
regexps, so there is not much you can do about that.

You are probably right that in some/many cases regexps are not the best 
solution, but if it is something which people know about then they will 
use them and we must give them a godd implementation.

When you talk about spliting the binary do you mean that this is done 
internally and transparently to the user? Or does the user see the list 
of segments? The origianl implementation of binaries had this capability 
of referencing segments of a large binary. Very good for pulling the 
binary apart. Although you could use it for splicing together if you 
were careful.

Robert

Jay Nelson wrote:
>
> Robert Virding wrote:
> 
> 
>>In this case it wouldn't help at all as the result from each 
>>substitution is passed directly into a new subst. You loop over a 
>>pre-defined set of substitutions threading the result through.
> 
> 
> 
> The goal of the paper was to come up with a very efficient way to
> handle exactly these types of situations, where a subset of a large
> piece of data needs to be found / extracted and then transformed in
> some way.  Using Joe's "inefficient coding style" works for small
> problems, but the heap quickly blows up when things get converted
> to lists.  If you filter them and garbage collect you might get by
> but not if the data source is 10s of MBs.
> 
> regexps are concise and convenient in a lot of situations, but there
> are two things I don't like: when the pattern gets to be so complicated
> that it can no longer be understood, and when it leads to particularly
> inefficient handling of data.  As long as the data size isn't too big,
> the latter case shouldn't matter, unless it needs to execute many times.
> 
> In erlang, a more efficient approach (and the one I tried to embrace
> in the paper), was to read the full data source as a single large binary.
> Any filtering or extraction was done so that the result was a list of
> sub-binaries pointing to the original data (essentially a struct with
> a pointer and a length which can be constructed quickly) without creating
> any intermediate structures or any garbage to collect.  The filtered
> list could then be transformed, resulting in computational effort only
> being taken on the elements of the data stream that mattered.
> 
> The fact that regexp uses substitution is an artifact of implementation
> or of sticking to the user's understanding of the model.  It is not
> necessary to solve the problem.
> 
> The sub-binary approach led to the natural desire for binary comprehensions.
> When that is soon available in the language, this approach will be very clear
> and reasonably efficient (although not as efficient as the BIFs, which
> really screamed).
> 
> I have been working more on patterns, parsing and extraction, but I won't
> have anything to say about it publicly for a few more months.