[erlang-questions] Regular expression library (was Not an Erlang fan)

Tue Sep 25 21:27:41 CEST 2007

After reading the original thread, I wondered what implementation of  
regular expression matching is used by Erlang?

I ask because this paper points out that many implementations have  
bad pathological behaviour:
http://swtch.com/~rsc/regexp/regexp1.html
An observation they make is:
"Regular expression matching can be simple and fast, using finite  
automata-based techniques that have been known for decades. In  
contrast, Perl, PCRE, Python, Ruby, Java, and many other languages  
have regular expression implementations based on recursive  
backtracking that are simple but can be excruciatingly slow."

This is followed by a discussion at http://lambda-the-ultimate.org/ 
node/2064
Which seems to come to a working implementation.
Usefully, they note that Thompson's original patent has expired, so  
there is no obstacle to using the original, and apparently superior,  
algorithm

I looked at 'man regex' on my Mac, and it describes the POSIX  
implementation of regex. It does contain the note "Originally written  
by Henry Spencer", which is suggested as the basis/culprit of the  
pathological regular expression implementations (I should add, as  
Henry Spencer is one of my heroes, that he is also credited with TCL  
8 regex, which works very well according to the graphs in the paper).

I have tried grep'ing through *.c in otp_src for regex and regcomp,  
and came up with nothing!

Does anyone know what is used in Erlang? It may be a relatively small  
amount of work to incorporate a new regex to provide more stable  
behaviour.

GB

On 25 Sep 2007, at 10:00, erlang-questions-request@REDACTED wrote:

> Date: Mon, 24 Sep 2007 09:20:08 -0700 (PDT)
> From: Thomas Lindgren <thomasl_erlang@REDACTED>
> Subject: Re: [erlang-questions] Not an Erlang fan
> To: "Erlang-Questions \(E-mail\)" <erlang-questions@REDACTED>
> Message-ID: <761781.89097.qm@REDACTED>
> Content-Type: text/plain; charset=iso-8859-1
>
>
> -
> Note that this does not include the actual regexp
> search of each line ... The code to do that could be
> an FSM + dictionary insert instead of the simple
> check-and-accumulate for newline. Extra cost? Unclear.
>
> One could merge that code into the scanning the binary
> but this should ideally be done by a regexp compiler,
> shouldn't it? Not done at this time AFAIK.
>
> Best,
> Thomas

> Date: Mon, 24 Sep 2007 19:55:12 +0200
> From: Claes Wikstrom <klacke@REDACTED>
> Subject: Re: [erlang-questions] Not an Erlang fan
> To: Bob Ippolito <bob@REDACTED>
> Cc: "Erlang-Questions \(E-mail\)" <erlang-questions@REDACTED>
> Message-ID: <46F7FA00.8070507@REDACTED>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> Bob Ippolito wrote:
>
>> On 9/24/07, Patrick Logan <patrickdlogan@REDACTED> wrote:
>>
>>>>>>  http://www.tbray.org/ongoing/When/200x/2007/09/22/Erlang
>>>>>>
>>>>>> Tim Bray might raise some valid points here, even if he's  
>>>>>> slightly
>>>>>> biased by his background.
>>>>>>
>
> No, the only fast way today to process a large file line/by/line is to
>
> 1. file:open(Filename, [read, raw])
> 2. In a loop {ok, Bin} = file:read(Fd, BufSize),
> 3. Use a binary regex matcher such as
>     http://yaws.hyber.org/download/posregex-1.0.tgz
>
> (I don't know the state of the regex lib in OTP today, last time
>   I looked it sucked bigtime though)
>
> /klacke
>