[erlang-questions] some language changes

Wed May 23 01:17:06 CEST 2007

When it comes to discussing Erlang language changes,
it seems impertinent to disagree with Joe Armstrong,
but I must protest.

On 21 May 2007, at 8:10 pm, Joe Armstrong wrote:

> 1. What you type in the shell and what you type in a module/escript
> should be the same

This one I don't care about so much, but it is worth examining.

This has never been the case in Smalltalk, and nobody seems to
notice any problem.  The cases are precisely analogous:  if you
want to define a class (think "module") or method (think "named
function") in Smalltalk you have to do it in a Browser (think
"editor"), not a Workspace (think "shell").   (There _is_ a way to
do it, but it's about 100 times harder.

It has never been the case in Prolog.  If you want to define a
module or predicate in Prolog, consult it from a file.  Again,
there is an alternative.  In Prolog's case, it's quite an easy
one.  One of the files you can consult is 'user' (think "/dev/tty")
so instead of typing
     ?- p(X) :- X = 1 ; X = 2.
     ?-
at the prompt ("?- " is the prompt), you do this:
     ?- [user]. % load code from /dev/tty
     p(X) :- X = 1 ; X = 2.
     end_of_file.
     ?-
Experienced Prolog programmers know that it is almost always a bad
idea to do this, because stuff you enter that way disappears from
sight.  You can call it, but not edit it.

And that's a really important point.  Joe is suggesting that
the Erlang shell should be changed to make it easier for beginners
to shoot themselves in the foot, instead of learning practices that
will help to keep them out of trouble.  I cannot think that a good idea.

While you can indeed do this in Lisp and Scheme, bear in mind that
Lisp and Scheme originally had no modules.  Scheme still has none,
and Lisp modules are, um, unusual in that they are never closed, so
it is always possible to add new stuff inside a module.  Nor have
Lisp's modules ever been as closely associated with files as Erlang's.

I would also instance Haskell as an Erlang-related language in which
the language used in modules and the language used in a shell (as in
hugs, ghci, hbi, ...) are quite different.  For example,
     f x = x + 1
is a perfectly good function definition in a file, but in a shell
it is a syntax error (double-checked in hbi just now).  (Yes, I know
that hbi does allow function definitions in the shell.  My point is
that the syntax used for that purpose is drastically different from
normal Haskell syntax.)

>
> 2. Hashmaps (aka associative arrays)

It's not clear quite what is being proposed here.

I am very keen to have psi-terms where the "keys" can be atoms
and atoms ONLY, because that will permit an EFFICIENT replacement
for records.  (Not as efficient as records, but a heck of a lot
more efficient than a-lists or general hash tables.)

Hash tables where the keys can be just about anything are another
matter, not least because Erlang/OTP already has them.
>
> 3. Extended string syntax: idea - put an atom *before the string quote
>     to say what the string means and to *change* the syntax rules  
> that apply
>     to the string content.
>
>      X = C "......"
>           C = a control atom
>
>        X = regexp " ... "
>
>           = html " .... "
>
>          Then we could write regexps and LaTeX inside strings without
>          all the horrible additional quotes

This is the one that stung me into writing.  Can Joe really be
serious here?  HOW are the syntax rules to be changed?  Is there
any limit at all?

Amongst other things, consider this:  there is not just ONE regexp
syntax, there are MANY.
   - Classic sh wildcards (? = ., * = [^/]*, [...] = [...])
   - Csh wildcards (as above but including {...,...,...}
   - ed(1) regular expressions
   - grep(1) regular expressions
   - egrep(1) regular expressions
   - awk(1) regular expressions
   - POSIX regular expressions (at least two kinds)
   ...
and then of course there are Perl and Java regular expressions, whose
designers missed the point of regular expressions, namely EFFICIENT
matching.  So either you build at least a dozen regular expression
syntaxes into the Erlang lexical analyser, or you define one single
regular expression syntax (possibly a new one) and find that
practically everyone can make no use of the feature because they need
one of the other syntaxes.

I'm old enough to have used (and loved, for all its quirkiness) SNOBOL.
I'm crazy enough to still have and occasionally use a copy of SNOBOL
on my Solaris box.  And if I am not going to get linear time matching
(thank you Perl, NOT!) then I don't see why I should have to put up
with the limitations of regexp syntax.  Give me SNOBOL!

More seriously, Erlang HAS leex and yecc.  So Erlang HAS means of
expressing pattern matches.

Again, if I want HTML, I want HTML, *not* strings.  Strings are
just about the worst possible way to represent HTML, with or without
funny rules.  More precisely, I want one of two things.
(a) I want something that looks just like HTML, with the ability
     to plug stuff into it.

     And yes, it would be entirely possible to extend Erlang syntax
     to include an XML-like form.

	<primary> ::= <tag open> <tag body>
         <tag open> ::= '<' <name> {<name> '=' <value>}...
	<name> ::= <variable> | <atom> | '(' <expression> ')'
         <value> ::= <non-XML primary>
	<tag body> ::= '/>'
                     | '>' [<expression> {',' <expression>}...
                        ['||' <generators> | '|' <expression>]] <end  
tag>
         <end tag> ::= '</' [<name>] '>'

     So instead of
        X = html "<table><tr><td>$(X)<td>$(Y)</table>"
     we would have
        X = <table><tr><td>X</><td>Y</></></table>

     This is not idle speculation.  I have a preprocessor for C that
     lets me do this kind of stuff in C.  Here is an extract from an
     actual C program:

     .   <table summary=""> <tr> <td width="160" valign="top">
         for_each_named_child(e1, i2, "PERSONA", j2, e2)
     .       ^e2 <br/>
         end_each_named_child
     .   </td> <td width="20"> "\240" </td>
     .   <td valign="bottom">
     .       <i> ^first_child_named(e1, "GRPDESCR") </i> <br/> </td>
     .   </tr> </table>

     The preprocessor is amazingly simple too.  I was an idiot; I should
     have written the preprocessor in AWK, not C.  Here's what that
     could look like in Erlang:

     T = <table summary="">
           <tr>
             <td width="160" valign="top">
               E2, <br/> || E2 <- named_children(E1, 'PERSONA')
             </td>,
             <td width="20"> "\240" </td>,
             <td valign="bottom">
               <i> first_named_child(E1, 'GRPDESCR') </i>, <br/>
             </td>
           </tr>
         </table>,

     Why would you want to make this look like a string?

     I add here an extremely important note.  The easy way to try out
     an extension like this is to write a preprocessor.  Erlang would
     not necessarily be the best language to do that in.  Haskell is
     far more compact (about a factor of 2).  However, every extra
     complication added to the language (like 'html "..."') makes it
     harder to write a preprocessor.  Yes, this is a paradox, but like
     all good paradoxes it's true:  the simpler the language is, the
     easier it is to experiment with extensions to it.

(b) I want to process (X(HT)?|HT)ML in a declarative language using
     a natural data structure in that language.  In this case, I need
     to know what the data structure actually *is*, and since with
     the 'html "..."' notation I *don't* know that, it is only another
     banana skin in my path.

     Now the notation I outlined above can be adapted to patterns as
     well.  Once the language is extended with psi-terms we can treat
     the attributes of an XML element as a psi-term and the children
     as a list.  So we could have stuff like

	rows(<table>|T</table>) -> reverse(rows_aux(T, [])).

	rows_aux([], R) -> R;
         rows_aux([X|Xs], R) -> rows_aux(Xs, rows_one(X, R)).

	rows_one(<caption>|_</>,  R) -> R;
         rows_one(<col>|_</>,      R) -> R;
         rows_one(<colgroup>|_</>, R) -> R;
         rows_one(<thead>|T</>,    R) -> rows_aux(T, R);
         rows_one(<tfoot>|T</>,    R) -> rows_aux(T, R);
         rows_one(<tbody>|T</>,    R) -> rows_aux(T, R);
         rows_one(X = <tr>|_</>,   R) -> [X|R].

     This is enough to compensate for not knowing what the data  
structure
     is; with this notation (mapped onto some Erlang data) I would be
     able to write Erlang code *as if* XML were a native data type.

     Given that we could have something like this, what would be the
     point of saddling ourselves with anything less capable.
>

> 4. Simple string substitutions
>
>      X = subst "aaa ${P} bbb ${Q} ccc"
>
>       *means*    X = [<"aaa">, P, <"bbb">, Q, ...]
>
>      This would be very useful.

Not really.  There is no string substitution in AWK.  So how does
that make AWK string expressions compared with Perl?  SHORTER!

Perl:
	"aaa ${P} bbb ${Q} ccc"
AWK:
	"aaa "P" bbb "Q" ccc"
Haskell:
	"aaa "++p++" bbb "++q++" ccc"

But wait!  Joe says that the translation of
     'subst "aaa ${P} bbb ${Q} ccc"'
is
     [<"aaa ">, P, <" bbb ">, Q, <" ccc">]
instead of the expected
     ("aaa " ++ P ++ " bbb " ++ Q ++ " ccc")

I must say that to anyone used to string interpolation in UNIX shells,
in Perl, in TCL, or practically anything I can think of that has it,
this is extremely surprising.  In all such languages, string
interpolation gives you a string, the same kind of string you would
normally get without any $() magic.

So let's amend this proposal:

     subst "aaa ${P} bbb ${Q} ccc"
		=> ["aaa ", P, " bbb ", Q, " ccc"]
     subst <"aaa ${P} bbb ${Q} ccc">
		=> [<"aaa ">, P, <" bbb ">, Q, <" ccc">]
     subst 'aaa ${P} bbb ${Q} ccc'
		=> ['aaa ', P, ' bbb ', Q, ' ccc']

Ask for a string and you GET an I/O list of strings.
Ask for a binary and you GET an I/O list of binaries.
Ask for an atom  and you get a      list of atoms.

There is another way in which this is confusing.  In other languages
with string interpolation, what is interpolated must be a string.
In this proposal it can apparently be any data structure whatever,
not excluding process IDs and funs.  If you are going to work with
I/O lists, it makes sense to interpolate I/O lists, but just how
much generality do we actually want?

Is string interpolation to be confined to the values of variables,
or will it allow arbitrary expressions between '${' and '}'.  If
only variables are allowed, why?  If expressions are allowed, I have
learned from the XML world that embedding languages inside each other
makes something very unpleasant to use, even when (as in the case of
XPath embedded in XSLT) one of them was *designed* to be embedded.

Let's look at the example again:
     subst "aaa ${P} bbb ${Q} ccc"		29 characters
     ["aaa ",P," bbb ",Q," ccc"]			27 characters
     subst <"aaa ${P} bbb ${Q} ccc">		31 characters
     [<"aaa ">,P,<" bbb ">,Q,<" ccc">]		33 characters

I'm not seeing any big savings here.

(I've been using <"xxx"> even though the actual Erlang syntax is
<<"xxx">> on the assumption that Joe is proposing that change as
well.)	

There are two different issues that one might want to consider.
  - How can we make it easier to key programs in?
    The proposed change will sometimes do this.
  - How can we make it easier to write working programs?
    That often requires good tool support.  Anything that makes it
    harder for an editor to find a variable (my editor knows how to
    find all the variables in a Prolog clause; I keep meaning to
    adapt that to Erlang) is something that will make it HARDER to
    write working programs.
>

> None of these are large changes - but they would make Erlang a  
> nicer language
> to program in. The (regexp "....") would be extremely useful to the  
> compiler
> and allow generating efficient regexp matching code :-)

I'm teaching a 4th year functional programming paper.
(Haskell, sorry.  Not Erlang.)
There is a programming language for Lego robots called Not Quite C.
The third assignment for the functional programming students is to
write a program that reads NQC programs and reports on which global
variables are used and set by which tasks.  In order to make this a
reasonable assignment, I have provided the students with
  - a lexical analyser
  - an abstract syntax tree
  - a parser which constructs ASTs
  - some functions for enumerating things in ASTs

The lexical structure of NQC is very close to C.
The lexical analyser in Haskell was 190 SLOC, for 69 token types.
That's 2.75 SLOC per token type.  One of those lines is the declaration
of the token kind (equivalent to a %token declaration in Yacc/Lex) and
I could have packed more on a line if I wanted.  Another line is the
one that reports what kind of token has been bound.  Given my layout
style, the minimum imaginable cost is 2.00 SLOC per token type.
Some of the overhead is due to writing for clarity.  With a bit of
effort, I just got the SLOC count down to 107, or 1.55 SLOC per token
type.  (Fair enough to pack the token declarations together because in
Erlang, Lisp, or Prolog there wouldn't be any.)

All in all, as I pointed out to the students, if you want to tokenise
a language similar to C and have a decent functional language, who the
heck NEEDS lex, or regular expressions?  Oddly enough, if you want a
lexical analyser for Haskell, it's easy enough to do in Haskell, but
you CAN'T do it with regular expressions!

So I simply do not believe that some vaguely described
'regexp "..."' syntax would make compiler writing easier.
One thing we can be sure of is that it would make the writing of
Erlang processing tools HARDER.