[erlang-questions] regular expressions (again :)

Richard A. O'Keefe ok@REDACTED
Thu Dec 14 02:08:33 CET 2006


Gaspar Chilingarov <nm@REDACTED> wrote:
	Calling regexp:match or similar functions is too ugly.
	
Why?

	what about following syntax?

It is incandescently ugly, that's what.

The major problem is that the semantics is NOT COMPOSITIONAL:
you cannot determine the meaning of a whole matching expression
in your syntax from the meanings of its parts.  This is an
excellent way to trick people into introducing more bugs than
anyone would want to see.

	String = "http://localhost/script?arg&arg2",
	
	_( TextVariable = RE) should check that pattern matches, so

Why should it be a variable?
Why use the binding operator "=" for a test?

	_(String == "https://") will produce false

In the previous line you had '=', now suddenly it's '=='.

	_(String == "http://(.*)/(.*)$") will produce
	     tuple {"localhost", "script?arg&arg2"} -- i.e. tuple with matched 	
	     elements
	
But the test '==' always produces true or false; you are now using it
to produce false or a tuple, which is rather confusing.

	_(TextVariable =:= RE) should do the same thing as perl m/re/g - i.e. 
	return all matches.

But the difference betweeen _(X == Y) and _(X =:= Y) is in no way
related to the difference between X == Y and X =:= Y (which I already
find hard enough to keep track of).

	_(String =:= "([?&]arg.*)") will return [ {"?arg"}, {"&arg2"} ]
	  in case if matching failed false should be returned.
	
	and the last one is substitution - I suggest to use /= and =/=
	so
	
Suddenly NEGATION means SUBSTITUTION?  That's a mice chips dubloon
if ever there was one.  (:-)

	_(String /= RE) will replace only one (first) match
	_(String =/= RE) will replace all matches in the string.
	
Replace them with WHAT?  Replacement says
    "Replace PATTERN with REPLACEMENT in SUBJECT"
as AWK's
    sub(Pattern, Replacement, Subject)
    gsub(Pattern, Replacement, Subject)
In your syntax I see a pattern (RE) and a subject (String), but
where is the replacement?
	
	_() construct is used, because I found no other short
	constructs, which parse correctly to erlang terms.
	Are there any ideas?
	
How about modifying the tokeniser to add some new operators?

    String ~ Pattern		Does String match Pattern (false/tuple)
    String ~~ Pattern		Return all matches for Pattern in String
				(list of tuples; no match => []).
    String ~ Pattern <- Repl	String with the first instance of Pattern
				(if any) replaced by Repl
    String ~~ Pattern <- Repl	String with all instances of Pattern
				replaced by Repl

<- is already a token.  Adding ~ and ~~ would take about 4 more
lines in erl_scan.erl.  Adding ~ and ~~ to the list of tokens in
erl_parse.yrl would take 1 line; recognising them as regex_op
would take
	regex_op -> '~'  : '$1'.
	regex_op -> '~~' : '$1'.
Recognising them at the same level as comparison operators would require

expr_200 -> expr_300 regex_op expr_300
          : mkop('$1', '$2', '$3').
expr_200 -> expr_300 regex_op expr_300 '<-' expr_300
          : mkop('$1', '$2', '$3', '$5').

mkop(S, {Op,Pos}, P, R) -> {op,Pos,Op,S,P,R}.

What are we up to?  12 lines?  Something like that.
Now all that's left is mapping

    S ~  P 		=> {something like} regexp:match(S, P)
    S ~~ P		=> {something like} regexp:matches(S, P)
    S ~  P <- R		=> {something like} regexp:sub(S, P, R)
    S ~~ P <- R		=> {something like} regexp:gsub(S, P, R)

With not very much work we could arrange for the parser to call
regexp:parse(P) at compile time when P is a string literal.

	Writing something like re:r(String == "http://(.*)/(.*)$") is also 
	possible, but requires more typing.
	
It is still remarkably ugly, because not compositional.	

There is probably a better way, which is why I'm not going to provide
patches to do this, although I easily could.  For example, what would
happen if we tried to assimilate regular expression matching to
pattern matching?  What if we could write patterns like this:

    "https://" ++ _ = String		% Hey, we DO have this!
    "http://" ++ Host ++ "/" ++ Path = String

No, this is NOT a suggestion either.  It's just an observation that
there are more places to look than at the back of the Perl manual.



More information about the erlang-questions mailing list