[erlang-questions] Regular expression library (was Not an Erlang fan)

Robert Virding rvirding@REDACTED
Wed Oct 3 23:32:11 CEST 2007


That's not too difficult to do at all. The speed you get would depend on
which regexp family you do it for.

Robert

On 27/09/2007, Yariv Sadan <yarivsadan@REDACTED> wrote:
>
> I was wondering if it would be possible/beneficial to write a regexp
> compiler that takes a regexp string and compiles it into an Erlang
> module.
>
> You would use it as follows:
>
> regexp2:compile("f*o+o?", myregexp, [native]),
> true = myregexp:matches("foo").
>
> This could be a very fast regexp engine.
>
> What do you think?
>
> Yariv
>
> On 9/26/07, Robert Virding <rvirding@REDACTED> wrote:
> > The regular expression module which is included in the distribution is a
> > simple rather naïve one written in Erlang which uses a back-tracking
> > algorithm directly on the AST. It has all the problems Russ Cox
> mentioned in
> > his paper. I wrote it long ago when no one was interested in parsing
> regexps
> > in Erlang. :-)
> >
> > I am rewriting it at the moment, still in Erlang but using the same
> > algorithm, or similar to the one, which Russ describes. It is about the
> same
> > speed as the existing one for simpler regexps but it can handle all the
> > pathological cases as it should. It will handle POSIX syntax and
> semantics
> > (but no back-references*). There will also be a version which will
> implement
> > a subset of PERL regexps with PERL semantics. Both will work on either
> lists
> > or binaries.
> >
> > I hadn't seen the lambda-the-ultimate discussion you mentioned and will
> > check it out.
> >
> > There are a number of regexp packages for Erlang which use PCRE or other
> > libraries which I don't have references to. Look in the "Not an Erlang
> fan"
> > thread here which gives some references to them.
> >
> > An interesting alternative would be to write the regexp parser/compiler
> in
> > Erlang and have the basic regexp matcher in the emulator. This actually
> > wouldn't be so big, I think. Get the best of both worlds. An I used to
> > complain (still do in fact) that the emulator is to big. :-)
> >
> > Robert
> >
> > * Actually there are two types of POSIX regexps in the standard, one
> which
> > has back-references and one which doesn't.
> >
> >
> >
> > On 25/09/2007, G Bulmer <gbulmer@REDACTED> wrote:
> > > After reading the original thread, I wondered what implementation of
> > > regular expression matching is used by Erlang?
> > >
> > > I ask because this paper points out that many implementations have
> > > bad pathological behaviour:
> > > http://swtch.com/~rsc/regexp/regexp1.html
> > > An observation they make is:
> > > "Regular expression matching can be simple and fast, using finite
> > > automata-based techniques that have been known for decades. In
> > > contrast, Perl, PCRE, Python, Ruby, Java, and many other languages
> > > have regular expression implementations based on recursive
> > > backtracking that are simple but can be excruciatingly slow."
> > >
> > > This is followed by a discussion at http://lambda-the-ultimate.org/
> > > node/2064
> > > Which seems to come to a working implementation.
> > > Usefully, they note that Thompson's original patent has expired, so
> > > there is no obstacle to using the original, and apparently superior,
> > > algorithm
> > >
> > > I looked at 'man regex' on my Mac, and it describes the POSIX
> > > implementation of regex. It does contain the note "Originally written
> > > by Henry Spencer", which is suggested as the basis/culprit of the
> > > pathological regular expression implementations (I should add, as
> > > Henry Spencer is one of my heroes, that he is also credited with TCL
> > > 8 regex, which works very well according to the graphs in the paper).
> > >
> > > I have tried grep'ing through *.c in otp_src for regex and regcomp,
> > > and came up with nothing!
> > >
> > > Does anyone know what is used in Erlang? It may be a relatively small
> > > amount of work to incorporate a new regex to provide more stable
> > > behaviour.
> > >
> > > GB
> > >
> > > On 25 Sep 2007, at 10:00,
> > erlang-questions-request@REDACTED wrote:
> > >
> > > > Date: Mon, 24 Sep 2007 09:20:08 -0700 (PDT)
> > > > From: Thomas Lindgren < thomasl_erlang@REDACTED>
> > > > Subject: Re: [erlang-questions] Not an Erlang fan
> > > > To: "Erlang-Questions \(E-mail\)" < erlang-questions@REDACTED>
> > > > Message-ID:
> > <761781.89097.qm@REDACTED >
> > > > Content-Type: text/plain; charset=iso-8859-1
> > > >
> > > >
> > > > -
> > > > Note that this does not include the actual regexp
> > > > search of each line ... The code to do that could be
> > > > an FSM + dictionary insert instead of the simple
> > > > check-and-accumulate for newline. Extra cost? Unclear.
> > > >
> > > > One could merge that code into the scanning the binary
> > > > but this should ideally be done by a regexp compiler,
> > > > shouldn't it? Not done at this time AFAIK.
> > > >
> > > > Best,
> > > > Thomas
> > >
> > > > Date: Mon, 24 Sep 2007 19:55:12 +0200
> > > > From: Claes Wikstrom <klacke@REDACTED>
> > > > Subject: Re: [erlang-questions] Not an Erlang fan
> > > > To: Bob Ippolito < bob@REDACTED>
> > > > Cc: "Erlang-Questions \(E-mail\)" < erlang-questions@REDACTED>
> > > > Message-ID: < 46F7FA00.8070507@REDACTED>
> > > > Content-Type: text/plain; charset=ISO-8859-1; format=flowed
> > > >
> > > > Bob Ippolito wrote:
> > > >
> > > >> On 9/24/07, Patrick Logan < patrickdlogan@REDACTED> wrote:
> > > >>
> > > >>>>>>
> > http://www.tbray.org/ongoing/When/200x/2007/09/22/Erlang
> > > >>>>>>
> > > >>>>>> Tim Bray might raise some valid points here, even if he's
> > > >>>>>> slightly
> > > >>>>>> biased by his background.
> > > >>>>>>
> > > >
> > > > No, the only fast way today to process a large file line/by/line is
> to
> > > >
> > > > 1. file:open(Filename, [read, raw])
> > > > 2. In a loop {ok, Bin} = file:read(Fd, BufSize),
> > > > 3. Use a binary regex matcher such as
> > > >     http://yaws.hyber.org/download/posregex-1.0.tgz
> > > >
> > > > (I don't know the state of the regex lib in OTP today, last time
> > > >   I looked it sucked bigtime though)
> > > >
> > > > /klacke
> > > >
> > > _______________________________________________
> > > erlang-questions mailing list
> > > erlang-questions@REDACTED
> > > http://www.erlang.org/mailman/listinfo/erlang-questions
> > >
> >
> >
> > _______________________________________________
> > erlang-questions mailing list
> > erlang-questions@REDACTED
> > http://www.erlang.org/mailman/listinfo/erlang-questions
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20071003/eca639e3/attachment.htm>


More information about the erlang-questions mailing list