pcre, bifs, drivers and ports

Tue Aug 8 23:57:48 CEST 2006

Seriously I was joking about about back-references, REs are complex 
enough as they are. :-) Also they are easy to implement if you don't 
worry about how other implementations do it.

I managed to find a POSIX RE definition which I think is recent, it is 
from Open Group Base Spec Issue 6 IIE Std 1003.1 2004. Their site 
contained many versions.

Richard, it is as you say AWK REs are Extended REs without the 
back-references. I don't have the book handy so I don't remember if AWK 
bracketed expressions had collating symbols, equivalence classes and 
character class expressions. Regexp doesn't have any of these, nor does 
it have interval expressions. Though why I didn't add interval 
expressions I can't remember. I also notice that I was sometimes lax in 
parsing and allow some undefined constructions, for example the bugs 
mentioned earlier.

Perhaps a better way would be to get rid as many undefined as possible 
and try to make it subset of the standard. Or, perhaps, try to make it 
follow another implementation. Apart from that there are some things I 
would like to add:

1. The missing interval expressions.
2. Accessing sub expressions (but not through back-references) through a 
new match function and in gsub/sub. This would be useful and probably 
efficient. I actually had an implementation that did this but lost it in 
a move. I have copy of the finnish paper and I will see what I can do 
when I understand what they are doing.
3. A compiler. It is not difficult and I already have one in leex that 
just needs re-packaging.

Just some thoughts.

Robert

Richard A. O'Keefe skrev:
> Robert Virding <robert.virding@REDACTED> wrote:
> 	The best way to go continue is to decide what we would like a regexp 
> 	package to be able to do. I used the regexps and interfaces (sub, gsub 
> 	etc) to them from standard AWK which were well defined, stable and 
> 	allowed you to do most things. Perhaps the POSIX definition is the way 
> 	to go.
> 
> HAH!
> 
> This year I've got my 3rd-year Software Engineering class maintaining Mawk.
> Amongst other things this has had me reading the Single Unix Specification,
> 3rd edition (which is pretty close to POSIX) with a microscope.
> 
> One thing we learn from a close reading is that AWK regular expressions
> are one thing (they include "|" but not back-references) and POSIX regular
> expressions are another (or more precisely, two other things: Basic REs
> contain neither "|" not back-references, Extended REs contain both).
> A particularly entertaining (if you like pain) read is the Mac OS X manual
> page for regcomp, which says near the end:
> 
>     There are a number of decisions that IEEE Std 1003.2 ... leaves up to
>     the implementor, either by explicitly saying "undefined" or by virtue
>     of them being forbidden by the RE grammar.
> 
> There then follow (by my count) 13 such decisions.
> 
> I particularly like the BUGS section where we are told:
>    The back-reference code is subtle and doubts linger about its correctness...
>    The regexec function is largely insensitive to RE complexity except
>    that back references are massively expensive. ...
>    Due to a mistake in IEEE STd 1003.2 things like "a)b" are legal ...
>    The standard definition of back references is vague ...
> 
> Since back references turn regexp matching from O(M*N) to NP-hard,
> maybe it's a good thing that AWK regular expressions don't include them
> and Erlang shouldn't either.
> 
>>From the AWK specification, just grepping for 'undefined', we find
>     \0 \00 \000 are all undefined
>     \x is undefined except for a handful of x values;
>        in particular \x is undefined.
>     Matching a string that contains a NUL anywhere is undefined.
>     sub(/regexp/, repl, (x)) is undefined.
> and implicitly, all of the issues identified in the Mac OS X manual
> page apply, and DO vary between AWK implementations.  (Hey, I've been
> writing test cases!)
> 
> 	Most features (for example backreferences) are pretty easy to implement 
> 	when interpreting the regexp or using an NFA but difficult compiled/with 
> 	a DFA. Actually adding backreferences is to regexp is probably easier 
> 	then fixing the bug Fredrik Thulin mentioned. And much more fun. :-)
> 	
> I am reluctant to suggest anything that might break compatiblity, but
> when that compatibility is somewhat tenuous to start with, the difference
> between something that's guaranteed fast (regexps WITHOUT back
> references) and something that's pretty much guaranteed slow *and* unclear
> (regexps WITH back references) begins to matter more.
> 
> I therefore strongly suggest that back references be kept out of Erlang
> regular expressions.  If they really were "pretty easy to implement"
> there would be more agreement about what they are supposed to do.