[erlang-questions] Adoption of perl/javascript-style regexp syntax

Thu Jun 4 04:09:47 CEST 2009

On 3 Jun 2009, at 9:35 pm, Vlad Dumitrescu wrote:
>> There is of course a much better way to deal with regular
>> expressions in a language like Lisp or Erlang.  One of my pet
>> slogans is "STRINGS ARE WRONG".
>
> I suppose that you mean something like "embedded strings in a language
> are wrong when representing anything else than plain text". And I
> couldn't agree more, they are evil - strings that represent for
> example a regexp should be a different data type than a text message
> string.

If we agree about that, everything else is less important.
>
>
>> The way to represent something
>> like "^[[:alpha:]_][[:alnum:]_]*:[[:space:]]" is
>>       rex:seq([rex:bol(),rex:id(),rex:space()])
>> where regular expression syntax is replaced by Erlang syntax.
>> This is so much more powerful than fancy quoting schemes for
>> strings that it just isn't funny: you can compute any subexpression
>> at any time you find useful _without_ new syntax, and without any
>> run-time parsing.
>
> [I am sure you already know all of the following, Richard, but from
> your answer above you might have forgot it in the spur of the moment]
>
> The same could be said about writing Erlang or C or Java parse trees
> directly instead of letting the parser build them for us from a
> string.

If you want to build them dynamically, or in another language,
yes.  Absolutely.

> Yet we don't do that because the textual representation has
> some advantages: it's easier to read, it is higher level, it's easier
> to modify and we're not bound to a specific internal representation.

It may be easier to READ, but it is far harder to WRITE correctly.
As for modifying, no, it is NOT easy to read.  And strings *are*
a specific internal representation.

> Regexps are (as you say) a structured datatype. Nobody disagrees. But
> we have a widespread, standard and compact way to represent them.

Wrong.  We have *many* ways to represent them.  We have shell
syntax, understood by fnmatch() and glob().  We have two POSIX
syntaxes.  We have AWK syntax, which though POSIX, isn't quite
identical to either of the others.  Oh, and lex/flex/jflex et all,
which are somewhat different again.  We have HyTime syntax.  We
have Perl.  We have PCRE where the "C" is pretty good but not
perfect.  We have Java regexp syntax, which is subtly different
again.  It simply is not even close to true that we have *A*
standard way to do it.

And this is another reason why trees are better.
Because we can express a regular expression in a way that is
independent of the target linear notation.  (Not independent
of the capabilities of the target _engine_ -- few 'regular
expression' engines support recursion, as misbegotten Perl does --
but independent of the fine details of the _notation_.)

To take just one example, given the pattern a\10b, what character
does the \10 represent?  Is it backspace, or newline?  If we
generate linear notation only when needed to communicate with
some other system, it is no longer *our* problem.

> Why
> wouldn't we want to use that instead of Erlang terms?

Because there simply is no one "that" for us to use.

> Given a compiler
> that understands this, the following examples will generate exactly
> the same code:
>    identifier() -> {seq,{cset,letters()},{star,{cset,continuers()}}}.
>    identifier() -> "{letters}{continuers}*".
> I know which one I find easier to read and understand.

Me too:  the first one.  Because the second one is a literal string.
It contains the _text_ l,e,t,t,e,r,s, but not in any reasonable sense
the _identifier_ letters.  I can create the first one AT RUN TIME.
When does "{letters}{continuers}*" when and when does
"{le"++"tter"++"s}{c"++"ontinuer"++"s}*" not work?  The second
approach creates such monstrous problems.  The first one eliminates  
them.

It is also simpler to write and test a compiler that deals correctly
with the first than one that deals with the second.

> Regarding your security concerns about cross-scripting, I don't think
> they are 100% relevant in this discussion. Those problems appear when
> one takes a string from the external world and "pastes" it mindlessly
> inside a program that is then executed.

Yup.  Exactly what we are talking about here.

Remember, I'm _also_ talking about receiving a string at run time
and including it in a regular expression which is then included
in something else.  I don't understand why anyone is satisfied
with compile-time-only semi-solutions.

> We are talking here about
> being able to let a string (the erlang source file) be tokenized and
> parsed by several scanners and parsers. There is no part in this
> string that is injected from the outside so that the programmer's
> intentions can be abused.

Oh?  And who said that all Erlang source files were constructed
by hand?
>
>
> All in all, regular expressions are just a particular case of embedded
> language.

Yes.  And as my string/JavaScript/XML/string example points out,
a particularly simple case.

> If there is to be any change to the Erlang syntax, I
> wouldn't want it tailored to a specific language.

And as the same thing points out, a technique that deals with
just ONE level of language embedding doesn't solve the problem
generally enough.

We need one conceptually simple approach that can be used to nest
and dynamically create instances of any number of languages.
Trees are much much better at that job than strings.

> For example, I want
> to be able to embed Erlang code inside Erlang, which would allow
> macros like LFE has and other goodies.

I am familiar with 'cc and xoc and I've seen something similar for
Java, not to mention Template Haskell.   But when I generate C
code from inside C, I use trees and love them.