[erlang-questions] Parsing with regexp (was regexp is slow)

Jay Nelson jay@REDACTED
Thu Nov 9 02:49:19 CET 2006


Robert Virding wrote:
> The program which was mentioned was specifically for bench-marking 
> regexps, so there is not much you can do about that.
True, but a substitution which replaces a substring with nothing can be 
done by excision.
>
> You are probably right that in some/many cases regexps are not the 
> best solution, but if it is something which people know about then 
> they will use them and we must give them a godd implementation.
True and I use them myself.  I would like to see them apply to strings 
or binaries transparently.
>
> When you talk about spliting the binary do you mean that this is done 
> internally and transparently to the user? Or does the user see the 
> list of segments? The origianl implementation of binaries had this 
> capability of referencing segments of a large binary. Very good for 
> pulling the binary apart. Although you could use it for splicing 
> together if you were careful.
I have been looking at it several ways, but typically an extraction from 
a big binary results in a list of segments that match the desired 
pattern.  So you would get back a list of binaries.  These could be 
turned into strings at the end to get the greatest performance benefit, 
but a more typical result.  It is using the segmenting referencing 
technique you described.  My paper last year did this with BIFs saving 
time by precomputing and allocating all needed memory.  I am trying now 
with pure erlang because the new binary comprehensions may provide 
enough performance to skip the BIFs  and I wanted the flexibility to 
change the API as I learn what is useful and what doesn't work.

My latest effort involves splitting a large binary into a series of 
records efficiently.  This allows for easy use of flat files as data 
sources, or for more convenient parsing of a binary protocol.  I am 
still working on a good API and solid documentation before I release 
this library.

If the library stabilizes and is shown acceptable by the community, I 
will go back to looking at BIFs that could speed the performance of the 
library.  They would be available for others to use, but most likely the 
library API would be the interface level and the BIFs would be hidden 
and called inside the library.

jay




More information about the erlang-questions mailing list