[erlang-questions] Regular expression library (was Not an Erlang fan)

G Bulmer gbulmer@REDACTED
Thu Sep 27 02:41:34 CEST 2007


I have no direct experience of it, but Regal (http:// 
www.cs.queensu.ca/~thurston/ragel/) has a good reputation.
It compiles regular expressions to several different languages, so I  
assume it is designed to support that kind of behaviour.

Regal claims it is used in the Ruby-based web server Mongrel, so any  
one interested in using Erlang to parse HTTP (and I assume HTML)  
might pick that work up too if Regal targeted Erlang!-)

NB: This is *not* an offer to go do the work :-)

Regal compiles a regular expression to a target language. My gossamer- 
thin understanding is Regal has more of a lex feel to it, i.e.  
generate a program that implements the regular expression matcher,  
but with a dynamic language like Erlang, that shouldn't be an issue.

Regal allows actions to be interspersed within the regular expression  
(hence my lex analogy), so a Regal-generated FSM should have the  
advantage that user-actions can be executed before the whole regular  
expression is matched.
With my even sketchier understanding of the core of Tim Bray's  
problem (of scanning continuously generated logs), this sounds like a  
very good match (pardon the pun).

TBH, I am reading my own views into Tim's requirements (but that  
won't stop me:-) I would like to leave a log-file analyser sucking up  
logs continuously, spitting out important stuff for urgent attention  
and maybe filtering the rest into different 'grades', rather than  
waiting for quarter-gig to build up, then picking over that data at a  
later stage.  Apologies Tim, as you can see, I don't understand the  
needs you are reflecting.

GB

On 27 Sep 2007, at 00:07, Yariv Sadan wrote:
> I was wondering if it would be possible/beneficial to write a regexp
> compiler that takes a regexp string and compiles it into an Erlang
> module.
>
> You would use it as follows:
>
> regexp2:compile("f*o+o?", myregexp, [native]),
> true = myregexp:matches("foo").
>
> This could be a very fast regexp engine.
>
> What do you think?
>
> Yariv
>
> On 9/26/07, Robert Virding  wrote:
>> The regular expression module which is included in the  
>> distribution is a
>> simple rather naïve one written in Erlang which uses a back-tracking
>> algorithm directly on the AST. It has all the problems Russ Cox  
>> mentioned in
>> his paper. I wrote it long ago when no one was interested in  
>> parsing regexps
>> in Erlang. :-)
>>
>> I am rewriting it at the moment, still in Erlang but using the same
>> algorithm, or similar to the one, which Russ describes. It is  
>> about the same
>> speed as the existing one for simpler regexps but it can handle  
>> all the
>> pathological cases as it should. It will handle POSIX syntax and  
>> semantics
>> (but no back-references*). There will also be a version which will  
>> implement
>> a subset of PERL regexps with PERL semantics. Both will work on  
>> either lists
>> or binaries.
>>
>> I hadn't seen the lambda-the-ultimate discussion you mentioned and  
>> will
>> check it out.
>>
>> There are a number of regexp packages for Erlang which use PCRE or  
>> other
>> libraries which I don't have references to. Look in the "Not an  
>> Erlang fan"
>> thread here which gives some references to them.
>>
>> An interesting alternative would be to write the regexp parser/ 
>> compiler in
>> Erlang and have the basic regexp matcher in the emulator. This  
>> actually
>> wouldn't be so big, I think. Get the best of both worlds. An I  
>> used to
>> complain (still do in fact) that the emulator is to big. :-)
>>
>> Robert
>>
>> * Actually there are two types of POSIX regexps in the standard,  
>> one which
>> has back-references and one which doesn't.
>>
>>
>>
>> On 25/09/2007, G Bulmer wrote:
>>> After reading the original thread, I wondered what implementation of
>>> regular expression matching is used by Erlang?
>>>
>>> I ask because this paper points out that many implementations have
>>> bad pathological behaviour:
>>> http://swtch.com/~rsc/regexp/regexp1.html
>>> An observation they make is:
>>> "Regular expression matching can be simple and fast, using finite
>>> automata-based techniques that have been known for decades. In
>>> contrast, Perl, PCRE, Python, Ruby, Java, and many other languages
>>> have regular expression implementations based on recursive
>>> backtracking that are simple but can be excruciatingly slow."
>>>
>>> This is followed by a discussion at http://lambda-the-ultimate.org/
>>> node/2064
>>> Which seems to come to a working implementation.
>>> Usefully, they note that Thompson's original patent has expired, so
>>> there is no obstacle to using the original, and apparently superior,
>>> algorithm
>>>
>>> I looked at 'man regex' on my Mac, and it describes the POSIX
>>> implementation of regex. It does contain the note "Originally  
>>> written
>>> by Henry Spencer", which is suggested as the basis/culprit of the
>>> pathological regular expression implementations (I should add, as
>>> Henry Spencer is one of my heroes, that he is also credited with TCL
>>> 8 regex, which works very well according to the graphs in the  
>>> paper).
>>>
>>> I have tried grep'ing through *.c in otp_src for regex and regcomp,
>>> and came up with nothing!
>>>
>>> Does anyone know what is used in Erlang? It may be a relatively  
>>> small
>>> amount of work to incorporate a new regex to provide more stable
>>> behaviour.
>>>
>>> GB
>>>
>>> On 25 Sep 2007, at 10:00,
>> erlang-questions-request@REDACTED wrote:
>>>
>>>> Date: Mon, 24 Sep 2007 09:20:08 -0700 (PDT)
>>>> From: Thomas Lindgren < thomasl_erlang@REDACTED>
>>>> Subject: Re: [erlang-questions] Not an Erlang fan
>>>> To: "Erlang-Questions \(E-mail\)" < erlang-questions@REDACTED>
>>>> Message-ID:
>> <761781.89097.qm@REDACTED >
>>>> Content-Type: text/plain; charset=iso-8859-1
>>>>
>>>>
>>>> -
>>>> Note that this does not include the actual regexp
>>>> search of each line ... The code to do that could be
>>>> an FSM + dictionary insert instead of the simple
>>>> check-and-accumulate for newline. Extra cost? Unclear.
>>>>
>>>> One could merge that code into the scanning the binary
>>>> but this should ideally be done by a regexp compiler,
>>>> shouldn't it? Not done at this time AFAIK.
>>>>
>>>> Best,
>>>> Thomas
>>>
>>>> Date: Mon, 24 Sep 2007 19:55:12 +0200
>>>> From: Claes Wikstrom
>>>> Subject: Re: [erlang-questions] Not an Erlang fan
>>>> To: Bob Ippolito
>>>> Cc: "Erlang-Questions \(E-mail\)" < erlang-questions@REDACTED>
>>>> Message-ID: < 46F7FA00.8070507@REDACTED>
>>>> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>>>>
>>>> Bob Ippolito wrote:
>>>>
>>>>> On 9/24/07, Patrick Logan wrote:
>>>>>
>>>>>>>>>
>> http://www.tbray.org/ongoing/When/200x/2007/09/22/Erlang
>>>>>>>>>
>>>>>>>>> Tim Bray might raise some valid points here, even if he's
>>>>>>>>> slightly
>>>>>>>>> biased by his background.
>>>>>>>>>
>>>>
>>>> No, the only fast way today to process a large file line/by/line  
>>>> is to
>>>>
>>>> 1. file:open(Filename, [read, raw])
>>>> 2. In a loop {ok, Bin} = file:read(Fd, BufSize),
>>>> 3. Use a binary regex matcher such as
>>>>     http://yaws.hyber.org/download/posregex-1.0.tgz
>>>>
>>>> (I don't know the state of the regex lib in OTP today, last time
>>>>   I looked it sucked bigtime though)
>>>>
>>>> /klacke
>>>>
>>> _______________________________________________
>>> erlang-questions mailing list
>>> erlang-questions@REDACTED
>>> http://www.erlang.org/mailman/listinfo/erlang-questions
>>>
>>
>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://www.erlang.org/mailman/listinfo/erlang-questions
>>




More information about the erlang-questions mailing list