[erlang-questions] some language changes

Mon May 28 06:12:02 CEST 2007

On 28 May 2007, at 10:41 am, Robert Virding wrote:
> The way I see it there are two problems being dicussed here:
>
> 1) another syntax for strings
> 2) another representation for regular expressions
>
> OK, some thoughts:
>
> 1) The basic problem is that there is quoting to allow "strange"  
> characters to be entered.

Note that " has to be counted as a "strange" character here.
If one had
	html "<foo bar="ugh">The programmer screamed "Help!"</foo>"
it could get very confusing.

Alien strings, and long strings, are not new problems.  I would draw
the attention of readers to ECMA Eiffel, where we read

As the following syntax indicates, there are two ways to write a  
manifest string:
* A Basic_manifest_string, the most common case, is a sequence of
   characters in double quotes, as in "This text". Some of the  
characters
   may be special character codes, such as %N representing a new line.
   This variant is useful for such frequent applications as object  
names,
   texts of simple messages to be displayed, labels of buttons and other
   user interface elements, generally using fairly short and simple
   sequences of characters.  You may write the string over several lines
   by ending an interrupted line with a percent character % and starting
   the next one, after possible blanks and tabs, by the same character.
* A Verbatim_string is a sequence of lines to be taken exactly as they
   are (hence the name), bracketed by "{ at the end of the line that
   precedes the sequence and }" at the beginning of the line (or "[ and
   "] to left-align the lines). No special character codes apply. This
   is useful for embedding multi-line texts; applications include
   description entries of Notes clauses, inline C code, SQL or XML
   queries to be passed to some external program.

"{...}" keeps the enclosed text exactly as is; "[...]" strips leading
white space so that stuff can be indented nicely.  You can actually
have stuff between " and { or [, and between ] or } and ".

> An easier solution would be to introduce an alternate string form  
> which has absolutely NO quoting at all. We use another delimiter,  
> ~. So
>
> X = ~[.?!][]\"')}]*\\($\\| $\\|\t\\|  \\)[ \t\n]*~
>
> would work fine. To include ~ you double it ~~.

In Eiffel, just

	X := "[
	[.?!][]\"')}]*\\($\\| $\\|\t\\|  \\)[ \t\n]*
	]";

I'm not saying I particularly LIKE this notation, only that it IS an
existing notation used in a programming language with an ECMA standard,
so it could be worth a good look.  And of course any IDE which can
handle it in Eiffel should surely be able to handle it (or a slight
variation) in Erlang.  It's not particularly hard to lex, either.
There are two great advantages of this:
(a) Compared with Joe Armstrong's approach, it is ONE quoting
     convention to be used with ANY other embedded language.
(b) Compared with Robert Virding's suggestion, there need be NO change
     to the quoted text, not even doubling tildes.  With "{...}", there
     is no alteration at all; with "[...]" there is only as much
     alteration as your text editor's indentation command will do and
     even then no printing characters are magic.
(c) [So I'm bad at counting.]  No new characters are magic in Eiffel
     at all (nor would they be in Erlang).  The starting character for
     a multi-line verbatim string is the SAME character used for an
     ordinary string; it is just when a string ends in mid-air at the
     end of a line with a { or [ there that something different happens.
     So "foo", "{foo}", "bar[ugh]" and so on are no trouble.  If you can
     type Erlang strings, tuples, and lists on your keyboard, you can
     type Eiffel multi-line verbatim strings.
>

> 2) Using a more functional syntax to specify regexps would work  
> well. I assume that this form would return the internal parsed form  
> and not a string. My only question is how do we declare these  
> functions to be special regexp functions?

Whyever would we want to do THAT?  The whole point of using functional
form for regular expressions is that it ISN'T a special syntax and
DOESN'T require or get special treatment.

Let me give an example.  I'm currently writing a compiler front end for
a small C-like language in Haskell, using parser combinators.  One of
the things that makes like particularly EASY for me doing this is that
such combinators AREN'T special syntax in Haskell, so I can easily
roll my own.  For example, I can define

     pFold0 :: Parser t y -> (x -> y -> x) -> x -> Parser t x

     pFold0 p f x = p <*> pFold0 p f . f x <|> pSucceed x

"Given a parser p that returns interpretations of type y,
  an initial value x of type x, and a combining function
  f that combines an x and a y to give an x, match a sequence
  of things that p matches, and combine their results using f
  starting from x as initial value."

pSucceed matches an empty token list and returns its argument.
<|> is alternation.
p <*> q matches whatever p matches, and passes its result to q,
which returns a parser and matches whatever that matches.
. is function composition.

Having done that, I can then use pFold0 as a parser combinator on
a level with other parser combinators.  And this is one of the reasons
why the parsing part of the front end is tiny.

Now apply this to Erlang and regular expressions.  In Perl, case
handling is done by special weird backslash sequences.  But we can
do things like

	cilit(Cs) -> seq([any([tolower(C),toupper(C)]) || C <- Cs]).

and then use cilit(...) for case-independent literal matching anywhere,
INCLUDING WITH RUN-TIME DATA.

> I suppose you could write:
>
> X = regex (seq(any(".?!"), span("\"')]}")
>                alt("\n", "\t", seq(" ",any(" \t\n"))),
>                span(" \t\n")))

No, you would write

     X = regexp:compile(AST)

where AST is a regular expression abstract syntax tree made at run
time from run time data using any combination of library regular
expression functions and user-defined combinators.  In cases where
all the operands were known at compile time and no user-defined
combinators were involved, I suppose the compiler might pre-evaluate
such a thing, as the compiler may pre-evaluate anything it takes a
fancy to, but regular expressions are MUCH more useful if they are not
limited to what you know at compile time.  In particular, suppose you
decide that you want to match Vim regular expressions (which are not
the same as Perl or Java or AWK or ... ones, though they are similar).
You can write your own parser for them (not a very hard task) calling
the regular expression AST construction functions, but this ONLY works
if they can be called JUST LIKE ANY OTHER NORMAL FUNCTION.

What I am saying is that we get the most benefit from regular  
expressions
if (1) they are NOT built into the compiler, and
    (2) what IS provided in a library is a regexp AST kit (and compiler
        from ASTs to anything else, I care not what) which can be used
        by any number of special-purpose parsers and freely intermixed
        with user-defined functions.
>
> The major problem with this solution is that people won't be able  
> to snip regexs directly out of Friedl's book but might actually be  
> forced to understand them. :-)

They can't do that *NOW* because there is no such animal as "THE
regular expression syntax."  Right now, you cannot snip regexps
directly out of Friedl's book and expect to use them in Vim, or to
snip regexps directly out of the Vim book and expect to use them in
Java, or ... you get the picture.  Presumably Friedl's book explains
this.

>
> I have absolutely no problems with doing something about this, but  
> we need to decide which problem we are solving. How much are  
> regular expressions ACTUALLY used in Erlang code? How much effort  
> is it worth putting in to solve this problem?

The Erlang regular expression library is sufficiently limited (in its
coverage, in its features, and in its insistence that "regular  
expression"
is a kind of string, not a kind of expression, so that layering another
syntax on top of it is painfully hard) that its present level of use is
no indication of how much or what kind of use regular expressions might
have in Erlang if it were improved.

The great thing about the approach that I'm suggesting is that it  
requires
no changes to the Erlang language or compiler whatever; regular
expressions remain entirely a library issue.