[erlang-questions] some language changes

Fri Jun 1 05:09:40 CEST 2007

I mentioned the Eiffel
	"{
	<verbatim stuff>
	}"
syntax.

On 1 Jun 2007, at 9:57 am, Robert Virding wrote:
> I would have no trouble accepting either just as long as you have  
> NO QUOTING at all. Not PHP '' strings where you need to quote both  
> \ and '. The Eiffel way adds extra lines and can break up an  
> expression.

If you want to include chunks of one language inside another, you
have to have SOME kind of quoting.  Take shell 'here' documents as
an example:

	<<'EOF'
	...
	...
	EOF

Every line is taken literally, up to but excluding the EOF line (a line
that is identical to the word following <<).  Use <<-'EOF' instead, and
leading tabs will be stripped from the data lines, so you can indent
the document nicely.  (This is rather like the "{ -vs- "[ distinction
in Eiffel.)  Remove the quotes from 'EOF' (so <<EOF or <<-EOF) and
command and parameter substitution and \ processing are done.
If you want some lines with leading tabs in the data, you have to use
<<'EOF' (or <<EOF) and give up on indentation.  If you want a line in
the data that exactly matches EOF, you are out of luck; you will have
to choose some other end of file magic word.  But *some* magic word
there must be, or all of the rest of the containing file will be taken.

There seem to be only a few ways to get the effect of NO quoting at all.
1.  The Eiffel/sh way: if the EOF string occurs in the data you want you
     are out of luck.  (In Eiffel's case, totally.)
2.  Use a number instead of quoting, like Fortran's late unlamented
     Hollerith literal.  I don't like the idea of writing
	44`[.?!][]\"')}]*\\($\\| $\\|\t\\|  \\)[ \t\n]*
     and I don't suppose anyone else does either.  (Not least because I
     probably counted wrong.)
3.  Use a word-processor-like interface where "string", "code", and
     "comment" are styles (hence not indicated in the text stream at
     all, but in a hidden markup stream).  Of course, this has severe
     trouble when you try to use a source file in language X containing
     a no-quotes string in language Y which contains a no-quotes string
     in language Z.

I don't see "break[ing] up an expression" as a problem at all.   
Remember,
one change I *do* like very much is adding
	<variable> = <constant expression>.
as a kind of top level definition.  So 'no-quotes' strings just plain
should never BE in expressions in the first place.  You should have

	End_Of_Sentence = "{
	[.?!][]\"')}]*\\($\\| $\\|\t\\|  \\)[ \t\n]*
	}".

or something of the source as a top level definition with a name.
This is also the reason why I don't see "add[ing] extra lines" as a
serious problem.

Bertrand Meyer is clever, but he's not the only clever person, and
it may well be possible to come up with something even better than
Eiffel's "{..}" and "[..]" long string syntax.  I'm not in love with
it myself.  But it IS prior art which DOESN'T involve any kind of  
quoting
in the body of the string.

> About the regular expression syntax, if I understand you correctly  
> then what you basically want to do is separate specifying/parsing  
> the regexp and applying it.

No.  That's what we have in most languages: re_compile (specify/parse)
+ re_match (apply).  What I want is FOUR things:

     1.  Source form.  There will eventually be MANY of these, each
	imitating closely one of the many many different regular
	expression syntaxes out there (Vim, Emacs, Java, Perl, Tcl,
	AWK, ...).  I would prefer that NONE of these should have
	special syntactic support, because I don't see any reason
	for any one of them to be so privileged.  (POSIX syntax would
	be the obvious candidate for this privilege, except that there
	are two POSIX syntaxes.)  The way these syntaxes should be
	supported is by library packages.  (In Ada they would be
	child packages of a regexp package; with Erlang's Java-wannabe
	flat-name-space-with-dots-in-it scheme there would be no point.)

     2.	Abstract syntax trees.  Just exactly what these _are_ should be
	the private concern of some module, but there should be a single
	set of functions one can use to construct abstract syntax trees.
	This is important because it lets one construct regular
	expressions with NO double or triple quoting, NO worries about
	exactly which syntax one is using, and with the marvellous
	power of functional abstraction available for constructing them.

	Just recently I marked some student code where half the students
	had it easy and half had it very hard.  The half who found it
	easy were working with data structures that represented the
	abstract syntax of their data:  parsing stuff that came from
	a file, unparsing stuff sent to a file, but otherwise working
	on a nice clean exceptionless data structure.  The half who
	found it hard were working with strings that held the external
	form of the data.  Almost every operation was nastily complex
	because of this.  Working with a textual representation of
	regular expressions is a NIGHTMARE.

     3.  A compiled form ready for some kind of execution.  There might
	be more than one of these.  One might be incremental and one
	not, for example.

     4.  Matching.

Given the variation in surface syntax for regular expressions, and
the variation on the ways one might want to compile them (DFAs, NDFAs,
backtracking parsers, ..., or to a data structure, to native code, or
to Erlang source code in the case of something like Leex), what I am
asking for really is the obvious interface:  the bit in the *middle*
that is common to all of them.

> I am almost ready with a new regexp module which will never  
> explode, is hopefully reasonably fast, works directly on binaries  
> as well as strings and can handle subexpressions. This version  
> support POSIX regexps and an interface based on AWK. All that is  
> left is to work out details of the interface and return values. It  
> is internally based on NFAs.

This is great news.