[erlang-questions] some language changes

ok ok@REDACTED
Wed May 23 02:04:45 CEST 2007


On 21 May 2007, at 10:24 pm, Joe Armstrong wrote:
   No - it changes the *syntax* of a string - normally you have to quote
   backspashes

   Suppose I had a *simple* regexp like this:

      [.?!][]\"')}]*\\($\\| $\\|\t\\|  \\)[ \t\n]*

   I'd like to say

      X = regexp "[.?!][]\"')}]*\\($\\| $\\|\t\\|  \\)[ \t\n]*"

   and not the more obvious

      X =  "[.?!][]\\"')}]*\\\\($\\\\| $\\\\|\t\\\\|  \\\\)[ \\t\\n]*"

Let's look at this in two different ways.
First, let's break the regexp up:

      [.?!][]\"')}]*\\($\\| $\\|\t\\|  \\)[ \t\n]*
      AAAAABBBBBBBBBCC(IJJ| KLL|MMOO|RSTT)UUUUUUUU

Note that regular expression syntax has weird quoting of its own.
(One reason I want to write regexps in Erlang AS Erlang!)
It looks as though [].... starts with an empty set, but in fact the
right bracket is an element of the set.  It looks as though there
are lots of backslashes, but (CC,JJ,LL,OO,TT), but on close
inspection (how HARD this is to read!) this appears to presuppose
a regular expression syntax in which the special meaning of (|) has
to be turned ON with backslashes, instead of the usual syntax where
backslashes turn the special meaning OFF.  (Just more reason NOT to
want this construction.  Which of the many regular expression
syntaxes do we actually get?)  Not only that, instead of \( \| \)
we find doubled backslashes!  I don't know any regexp syntax that
requires \\(...\\|...\\) for an alternation, and if I did, I would
not want to use it.

So here's how I would like to write that:

     X = seq(any(".?!"), span("\"')]}"),
              alt("\n", "\t", seq(" ",any(" \t\n"))),
              span(" \t\n"))

This is BETTER than fancy regexp syntax, because it's just normal
Erlang syntax that can include *any* computations we find useful.
For example,

	Stops = ".?!",
	Closers = "\"')]}",
	BigSpace = "\t\n",
	Space = " "++BigSpace,
	X = seq(any(Stops),span(Closers),
	        alt(any(BigSpace),seq(" ",any(Space)), span(Space))

Second, suppose for some reason we don't like function calls, and we
do like regular expression syntax, with all the backslashes *that*
requires.  Let me introduce you to the idea of a preprocessor.

Our input syntax will be

	/<stuff>/

on one line by itself, possibly followed by a comma
or semicolon, possibly followed by a comment.
We want to replace this by

	regexp:compile("<stuff'>")

where <stuff'> is <stuff> with appropriate backslashes added,
putting the comma or semicolon back if there was one.
What do we need to quote?
	- double quotes
	- backslashes

Here we go.

	#!/bin/awk -f

	$0 ~ /^[ \t]*\/.*\/[,;]?[ \t]*(%.*)?$/ {
	    x = $0
	    match(x, /^[ \t]*\//)
	    head = substr(x, 1, RLENGTH-1)
	    x = substr(x, RLENGTH+1)
	    sub(/[ \t]*(%.*)?$/, "", x)
	    if (x ~ /[,;]$/) {
		tail = substr(x, length(x), 1)
		x = substr(x, 1, length(x) - 1)
	    } else {
		tail = ""
	    }
	    sub(/\/$/, "", x)
	    gsub(/["\\]/, "\\\\&", x)
	    print head "regexp:compile(\"" x "\")" tail
	    next
	}
	{
	    print
	}

I have tested this on some small examples and it seems to work.  It took
some doing, precisely because regular expression syntax is so hard to
work with, compared with normal Erlang syntax.

This preprocessor is just 16 SLOC of AWK.  For *THIS* we are to
make Erlang lexical structure more complicated and to break editor
support for the language?




More information about the erlang-questions mailing list