pcre, bifs, drivers and ports

Richard A. O'Keefe ok@REDACTED
Mon Aug 7 04:09:53 CEST 2006

Robert Virding <robert.virding@REDACTED> wrote:
	The best way to go continue is to decide what we would like a regexp 
	package to be able to do. I used the regexps and interfaces (sub, gsub 
	etc) to them from standard AWK which were well defined, stable and 
	allowed you to do most things. Perhaps the POSIX definition is the way 
	to go.


This year I've got my 3rd-year Software Engineering class maintaining Mawk.
Amongst other things this has had me reading the Single Unix Specification,
3rd edition (which is pretty close to POSIX) with a microscope.

One thing we learn from a close reading is that AWK regular expressions
are one thing (they include "|" but not back-references) and POSIX regular
expressions are another (or more precisely, two other things: Basic REs
contain neither "|" not back-references, Extended REs contain both).
A particularly entertaining (if you like pain) read is the Mac OS X manual
page for regcomp, which says near the end:

    There are a number of decisions that IEEE Std 1003.2 ... leaves up to
    the implementor, either by explicitly saying "undefined" or by virtue
    of them being forbidden by the RE grammar.

There then follow (by my count) 13 such decisions.

I particularly like the BUGS section where we are told:
   The back-reference code is subtle and doubts linger about its correctness...
   The regexec function is largely insensitive to RE complexity except
   that back references are massively expensive. ...
   Due to a mistake in IEEE STd 1003.2 things like "a)b" are legal ...
   The standard definition of back references is vague ...

Since back references turn regexp matching from O(M*N) to NP-hard,
maybe it's a good thing that AWK regular expressions don't include them
and Erlang shouldn't either.

>From the AWK specification, just grepping for 'undefined', we find
    \0 \00 \000 are all undefined
    \x is undefined except for a handful of x values;
       in particular \x is undefined.
    Matching a string that contains a NUL anywhere is undefined.
    sub(/regexp/, repl, (x)) is undefined.
and implicitly, all of the issues identified in the Mac OS X manual
page apply, and DO vary between AWK implementations.  (Hey, I've been
writing test cases!)

	Most features (for example backreferences) are pretty easy to implement 
	when interpreting the regexp or using an NFA but difficult compiled/with 
	a DFA. Actually adding backreferences is to regexp is probably easier 
	then fixing the bug Fredrik Thulin mentioned. And much more fun. :-)
I am reluctant to suggest anything that might break compatiblity, but
when that compatibility is somewhat tenuous to start with, the difference
between something that's guaranteed fast (regexps WITHOUT back
references) and something that's pretty much guaranteed slow *and* unclear
(regexps WITH back references) begins to matter more.

I therefore strongly suggest that back references be kept out of Erlang
regular expressions.  If they really were "pretty easy to implement"
there would be more agreement about what they are supposed to do.

More information about the erlang-questions mailing list