[erlang-questions] Regular expression library (was Not an Erlang fan)

Yariv Sadan <>
Thu Sep 27 01:07:08 CEST 2007


I was wondering if it would be possible/beneficial to write a regexp
compiler that takes a regexp string and compiles it into an Erlang
module.

You would use it as follows:

regexp2:compile("f*o+o?", myregexp, [native]),
true = myregexp:matches("foo").

This could be a very fast regexp engine.

What do you think?

Yariv

On 9/26/07, Robert Virding <> wrote:
> The regular expression module which is included in the distribution is a
> simple rather naïve one written in Erlang which uses a back-tracking
> algorithm directly on the AST. It has all the problems Russ Cox mentioned in
> his paper. I wrote it long ago when no one was interested in parsing regexps
> in Erlang. :-)
>
> I am rewriting it at the moment, still in Erlang but using the same
> algorithm, or similar to the one, which Russ describes. It is about the same
> speed as the existing one for simpler regexps but it can handle all the
> pathological cases as it should. It will handle POSIX syntax and semantics
> (but no back-references*). There will also be a version which will implement
> a subset of PERL regexps with PERL semantics. Both will work on either lists
> or binaries.
>
> I hadn't seen the lambda-the-ultimate discussion you mentioned and will
> check it out.
>
> There are a number of regexp packages for Erlang which use PCRE or other
> libraries which I don't have references to. Look in the "Not an Erlang fan"
> thread here which gives some references to them.
>
> An interesting alternative would be to write the regexp parser/compiler in
> Erlang and have the basic regexp matcher in the emulator. This actually
> wouldn't be so big, I think. Get the best of both worlds. An I used to
> complain (still do in fact) that the emulator is to big. :-)
>
> Robert
>
> * Actually there are two types of POSIX regexps in the standard, one which
> has back-references and one which doesn't.
>
>
>
> On 25/09/2007, G Bulmer <> wrote:
> > After reading the original thread, I wondered what implementation of
> > regular expression matching is used by Erlang?
> >
> > I ask because this paper points out that many implementations have
> > bad pathological behaviour:
> > http://swtch.com/~rsc/regexp/regexp1.html
> > An observation they make is:
> > "Regular expression matching can be simple and fast, using finite
> > automata-based techniques that have been known for decades. In
> > contrast, Perl, PCRE, Python, Ruby, Java, and many other languages
> > have regular expression implementations based on recursive
> > backtracking that are simple but can be excruciatingly slow."
> >
> > This is followed by a discussion at http://lambda-the-ultimate.org/
> > node/2064
> > Which seems to come to a working implementation.
> > Usefully, they note that Thompson's original patent has expired, so
> > there is no obstacle to using the original, and apparently superior,
> > algorithm
> >
> > I looked at 'man regex' on my Mac, and it describes the POSIX
> > implementation of regex. It does contain the note "Originally written
> > by Henry Spencer", which is suggested as the basis/culprit of the
> > pathological regular expression implementations (I should add, as
> > Henry Spencer is one of my heroes, that he is also credited with TCL
> > 8 regex, which works very well according to the graphs in the paper).
> >
> > I have tried grep'ing through *.c in otp_src for regex and regcomp,
> > and came up with nothing!
> >
> > Does anyone know what is used in Erlang? It may be a relatively small
> > amount of work to incorporate a new regex to provide more stable
> > behaviour.
> >
> > GB
> >
> > On 25 Sep 2007, at 10:00,
>  wrote:
> >
> > > Date: Mon, 24 Sep 2007 09:20:08 -0700 (PDT)
> > > From: Thomas Lindgren < >
> > > Subject: Re: [erlang-questions] Not an Erlang fan
> > > To: "Erlang-Questions \(E-mail\)" < >
> > > Message-ID:
> < >
> > > Content-Type: text/plain; charset=iso-8859-1
> > >
> > >
> > > -
> > > Note that this does not include the actual regexp
> > > search of each line ... The code to do that could be
> > > an FSM + dictionary insert instead of the simple
> > > check-and-accumulate for newline. Extra cost? Unclear.
> > >
> > > One could merge that code into the scanning the binary
> > > but this should ideally be done by a regexp compiler,
> > > shouldn't it? Not done at this time AFAIK.
> > >
> > > Best,
> > > Thomas
> >
> > > Date: Mon, 24 Sep 2007 19:55:12 +0200
> > > From: Claes Wikstrom <>
> > > Subject: Re: [erlang-questions] Not an Erlang fan
> > > To: Bob Ippolito < >
> > > Cc: "Erlang-Questions \(E-mail\)" < >
> > > Message-ID: < >
> > > Content-Type: text/plain; charset=ISO-8859-1; format=flowed
> > >
> > > Bob Ippolito wrote:
> > >
> > >> On 9/24/07, Patrick Logan < > wrote:
> > >>
> > >>>>>>
> http://www.tbray.org/ongoing/When/200x/2007/09/22/Erlang
> > >>>>>>
> > >>>>>> Tim Bray might raise some valid points here, even if he's
> > >>>>>> slightly
> > >>>>>> biased by his background.
> > >>>>>>
> > >
> > > No, the only fast way today to process a large file line/by/line is to
> > >
> > > 1. file:open(Filename, [read, raw])
> > > 2. In a loop {ok, Bin} = file:read(Fd, BufSize),
> > > 3. Use a binary regex matcher such as
> > >     http://yaws.hyber.org/download/posregex-1.0.tgz
> > >
> > > (I don't know the state of the regex lib in OTP today, last time
> > >   I looked it sucked bigtime though)
> > >
> > > /klacke
> > >
> > _______________________________________________
> > erlang-questions mailing list
> > 
> > http://www.erlang.org/mailman/listinfo/erlang-questions
> >
>
>
> _______________________________________________
> erlang-questions mailing list
> 
> http://www.erlang.org/mailman/listinfo/erlang-questions
>



More information about the erlang-questions mailing list