[erlang-questions] regexp sux! (but perhaps less now)

Fri Jun 15 00:32:25 CEST 2007

Bengt Kleberg wrote:
> On 2007-06-04 22:34, Robert Virding wrote:
>> tobbe wrote:
>>> 1> re:match("now/plus42hours/","^now/(plus|minus)(\d{1,2})hours/$").
>>> nomatch
>>> 2> re:smatch("now/plus42hours/","^now/(plus|minus)([[:alnum:]])hours/$").
>>> nomatch
>>> 3> re:smatch("now/plus42hours/","now/(plus|minus)([[:alnum:]])hours/").
>>> nomatch
>> OK:
>> 1) \d is a PERLism and as I wrote I only support POSIX style regexps. As 
>> the regexp is a string it would have to be "\\d" as the '\' needs to be 
> 
> if somebody is interested in something else than ''normal regular 
> expressions'' (where normal is awk, sed, posix, perl, etc) i can recommend
> http://www.scsh.net/docu/html/man-Z-H-7.html#node_idx_1178
> 
> it is regexp for the scheme shell. it has s-expressions instead of 
> strings. i find it easier to use when the regular expression goes beyond 
> that which is possible to do with strstr and friends.

Sorry for taking so long to answer this.

The is definitely interesting. What it describes is along the same lines 
as what Richard O'Keefe was suggesting, defining the regular expression 
with a structure instead of with a string. They wrap the s-expr form 
with a read macro which parses the s-expr and builds an internal 
representation. One interesting point is that when matching it does not 
return an explicit structure with the results of the match, but instead 
an ADT with a set of access functions.

One benefit of doing this is that as the internal structure of the ADT 
is undefined and data only accessible though the access functions then 
you are free to change the internals. The downside is not being able to 
pattern match on the result. What do people feel is the best way to go?

I rather like having both the string form for a regular expression and a 
structural representation. It easier to get it more beautiful in Lisp I 
think. For Erlang would could either use terms directly or have a more 
functional way as Richard described. So instead of "[a-c]*|z+" you could 
have:

{alt,{'*',{cc,"a-c"}},{'+',{c,$z}}}

or

alt('*'(cc("a-c")),'+'($z))

Can't think of better names for the closures right now, using kclosure 
and pclosure seems so long.

Robert