That's not too difficult to do at all. The speed you get would depend on which regexp family you do it for.<br><br>Robert<br><br><div><span class="gmail_quote">On 27/09/2007, <b class="gmail_sendername">Yariv Sadan</b>

 <<a href="mailto:yarivsadan@gmail.com">yarivsadan@gmail.com</a>> wrote:</span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">I was wondering if it would be possible/beneficial to write a regexp

compiler that takes a regexp string and compiles it into an Erlang module. You would use it as follows: regexp2:compile("f*o+o?", myregexp, [native]), true = myregexp:matches("foo").

<br><br>This could be a very fast regexp engine.<br><br>What do you think?<br><br>Yariv<br><br>On 9/26/07, Robert Virding <<a href="mailto:rvirding@gmail.com">rvirding@gmail.com</a>> wrote:<br>> The regular expression module which is included in the distribution is a

<br>> simple rather naīve one written in Erlang which uses a back-tracking<br>> algorithm directly on the AST. It has all the problems Russ Cox mentioned in<br>> his paper. I wrote it long ago when no one was interested in parsing regexps

<br>> in Erlang. :-)<br>><br>> I am rewriting it at the moment, still in Erlang but using the same<br>> algorithm, or similar to the one, which Russ describes. It is about the same<br>> speed as the existing one for simpler regexps but it can handle all the

<br>> pathological cases as it should. It will handle POSIX syntax and semantics<br>> (but no back-references*). There will also be a version which will implement<br>> a subset of PERL regexps with PERL semantics. Both will work on either lists

<br>> or binaries.<br>><br>> I hadn't seen the lambda-the-ultimate discussion you mentioned and will<br>> check it out.<br>><br>> There are a number of regexp packages for Erlang which use PCRE or other

<br>> libraries which I don't have references to. Look in the "Not an Erlang fan"<br>> thread here which gives some references to them.<br>><br>> An interesting alternative would be to write the regexp parser/compiler in

<br>> Erlang and have the basic regexp matcher in the emulator. This actually<br>> wouldn't be so big, I think. Get the best of both worlds. An I used to<br>> complain (still do in fact) that the emulator is to big. :-)

<br>><br>> Robert<br>><br>> * Actually there are two types of POSIX regexps in the standard, one which<br>> has back-references and one which doesn't.<br>><br>><br>><br>> On 25/09/2007, G Bulmer <

<a href="mailto:gbulmer@gmail.com">gbulmer@gmail.com</a>> wrote:<br>> > After reading the original thread, I wondered what implementation of<br>> > regular expression matching is used by Erlang?<br>> >

<br>> > I ask because this paper points out that many implementations have<br>> > bad pathological behaviour:<br>> > <a href="http://swtch.com/~rsc/regexp/regexp1.html">http://swtch.com/~rsc/regexp/regexp1.html

</a><br>> > An observation they make is:<br>> > "Regular expression matching can be simple and fast, using finite<br>> > automata-based techniques that have been known for decades. In<br>> > contrast, Perl, PCRE, Python, Ruby, Java, and many other languages

<br>> > have regular expression implementations based on recursive<br>> > backtracking that are simple but can be excruciatingly slow."<br>> ><br>> > This is followed by a discussion at <a href="http://lambda-the-ultimate.org/">

http://lambda-the-ultimate.org/</a><br>> > node/2064<br>> > Which seems to come to a working implementation.<br>> > Usefully, they note that Thompson's original patent has expired, so<br>> > there is no obstacle to using the original, and apparently superior,

<br>> > algorithm<br>> ><br>> > I looked at 'man regex' on my Mac, and it describes the POSIX<br>> > implementation of regex. It does contain the note "Originally written<br>> > by Henry Spencer", which is suggested as the basis/culprit of the

> > pathological regular expression implementations (I should add, as > > Henry Spencer is one of my heroes, that he is also credited with TCL > > 8 regex, which works very well according to the graphs in the paper).

<br>> ><br>> > I have tried grep'ing through *.c in otp_src for regex and regcomp,<br>> > and came up with nothing!<br>> ><br>> > Does anyone know what is used in Erlang? It may be a relatively small

<br>> > amount of work to incorporate a new regex to provide more stable<br>> > behaviour.<br>> ><br>> > GB<br>> ><br>> > On 25 Sep 2007, at 10:00,<br>> <a href="mailto:erlang-questions-request@erlang.org">

erlang-questions-request@erlang.org</a> wrote:<br>> ><br>> > > Date: Mon, 24 Sep 2007 09:20:08 -0700 (PDT)<br>> > > From: Thomas Lindgren < <a href="mailto:thomasl_erlang@yahoo.com">thomasl_erlang@yahoo.com

</a>><br>> > > Subject: Re: [erlang-questions] Not an Erlang fan<br>> > > To: "Erlang-Questions \(E-mail\)" < <a href="mailto:erlang-questions@erlang.org">erlang-questions@erlang.org</a>>

<br>> > > Message-ID:<br>> <<a href="mailto:761781.89097.qm@web38802.mail.mud.yahoo.com">761781.89097.qm@web38802.mail.mud.yahoo.com</a> ><br>> > > Content-Type: text/plain; charset=iso-8859-1<br>

> > ><br>> > ><br>> > > -<br>> > > Note that this does not include the actual regexp<br>> > > search of each line ... The code to do that could be<br>> > > an FSM + dictionary insert instead of the simple

<br>> > > check-and-accumulate for newline. Extra cost? Unclear.<br>> > ><br>> > > One could merge that code into the scanning the binary<br>> > > but this should ideally be done by a regexp compiler,

<br>> > > shouldn't it? Not done at this time AFAIK.<br>> > ><br>> > > Best,<br>> > > Thomas<br>> ><br>> > > Date: Mon, 24 Sep 2007 19:55:12 +0200<br>> > > From: Claes Wikstrom <

<a href="mailto:klacke@hyber.org">klacke@hyber.org</a>><br>> > > Subject: Re: [erlang-questions] Not an Erlang fan<br>> > > To: Bob Ippolito < <a href="mailto:bob@redivi.com">bob@redivi.com</a>>

<br>> > > Cc: "Erlang-Questions \(E-mail\)" < <a href="mailto:erlang-questions@erlang.org">erlang-questions@erlang.org</a>><br>> > > Message-ID: < <a href="mailto:46F7FA00.8070507@hyber.org">

46F7FA00.8070507@hyber.org</a>><br>> > > Content-Type: text/plain; charset=ISO-8859-1; format=flowed<br>> > ><br>> > > Bob Ippolito wrote:<br>> > ><br>> > >> On 9/24/07, Patrick Logan < 

<a href="mailto:patrickdlogan@gmail.com">patrickdlogan@gmail.com</a>> wrote:<br>> > >><br>> > >>>>>><br>> <a href="http://www.tbray.org/ongoing/When/200x/2007/09/22/Erlang">http://www.tbray.org/ongoing/When/200x/2007/09/22/Erlang

</a><br>> > >>>>>><br>> > >>>>>> Tim Bray might raise some valid points here, even if he's<br>> > >>>>>> slightly<br>> > >>>>>> biased by his background.

<br>> > >>>>>><br>> > ><br>> > > No, the only fast way today to process a large file line/by/line is to<br>> > ><br>> > > 1. file:open(Filename, [read, raw])<br>

> > > 2. In a loop {ok, Bin} = file:read(Fd, BufSize),<br>> > > 3. Use a binary regex matcher such as<br>> > >     <a href="http://yaws.hyber.org/download/posregex-1.0.tgz">http://yaws.hyber.org/download/posregex-1.0.tgz

</a><br>> > ><br>> > > (I don't know the state of the regex lib in OTP today, last time<br>> > >   I looked it sucked bigtime though)<br>> > ><br>> > > /klacke<br>> > >

<br>> > _______________________________________________<br>> > erlang-questions mailing list<br>> > <a href="mailto:erlang-questions@erlang.org">erlang-questions@erlang.org</a><br>> > <a href="http://www.erlang.org/mailman/listinfo/erlang-questions">

http://www.erlang.org/mailman/listinfo/erlang-questions</a><br>> ><br>><br>><br>> _______________________________________________<br>> erlang-questions mailing list<br>> <a href="mailto:erlang-questions@erlang.org">

erlang-questions@erlang.org</a><br>> <a href="http://www.erlang.org/mailman/listinfo/erlang-questions">http://www.erlang.org/mailman/listinfo/erlang-questions</a><br>><br></blockquote></div><br>