[erlang-questions] Parsing with regexp (was regexp is slow)
Jay Nelson
jay@REDACTED
Thu Nov 9 02:49:19 CET 2006
Robert Virding wrote:
> The program which was mentioned was specifically for bench-marking
> regexps, so there is not much you can do about that.
True, but a substitution which replaces a substring with nothing can be
done by excision.
>
> You are probably right that in some/many cases regexps are not the
> best solution, but if it is something which people know about then
> they will use them and we must give them a godd implementation.
True and I use them myself. I would like to see them apply to strings
or binaries transparently.
>
> When you talk about spliting the binary do you mean that this is done
> internally and transparently to the user? Or does the user see the
> list of segments? The origianl implementation of binaries had this
> capability of referencing segments of a large binary. Very good for
> pulling the binary apart. Although you could use it for splicing
> together if you were careful.
I have been looking at it several ways, but typically an extraction from
a big binary results in a list of segments that match the desired
pattern. So you would get back a list of binaries. These could be
turned into strings at the end to get the greatest performance benefit,
but a more typical result. It is using the segmenting referencing
technique you described. My paper last year did this with BIFs saving
time by precomputing and allocating all needed memory. I am trying now
with pure erlang because the new binary comprehensions may provide
enough performance to skip the BIFs and I wanted the flexibility to
change the API as I learn what is useful and what doesn't work.
My latest effort involves splitting a large binary into a series of
records efficiently. This allows for easy use of flat files as data
sources, or for more convenient parsing of a binary protocol. I am
still working on a good API and solid documentation before I release
this library.
If the library stabilizes and is shown acceptable by the community, I
will go back to looking at BIFs that could speed the performance of the
library. They would be available for others to use, but most likely the
library API would be the interface level and the BIFs would be hidden
and called inside the library.
jay
More information about the erlang-questions
mailing list