[erlang-questions] Adoption of perl/javascript-style regexp syntax

Wed Jun 3 11:35:32 CEST 2009

Hi,

On Tue, Jun 2, 2009 at 23:54, Richard O'Keefe <ok@REDACTED> wrote:
> On 2 Jun 2009, at 7:08 pm, Vlad Dumitrescu wrote:
>> As a programmer I like this way of handling this kind of issues
>> because it works now and it's easy.
>> As developer of a source handling tool I can't help but cringe at the
>> prospect of getting requests to support all kinds of homegrown
>> syntaxes...
>
> You mean like regular expression syntaxes?
> I've lost count of the number of different variations of
> regular expression syntax I've seen in UNIX.
> The point of the wee tool I mentioned of course was to provide
> *non*-syntax.

No, what i mean is syntaxes for allowing people to mark some strings
as regular expressions so that a tool can process them and add
backslashes or whatever. A source file containing such a marker would
no longer be an Erlang source file, and it can't be handled by Erlang
tools anymore.

>> Another problem with external processing of the source files is that
>> it is at the same level as the preprocessor,
> Well, no, it understands far less of Erlang syntax than the
> Erlang preprocessor does, and operates way before it.

Even worse, then. I was being nice.

> But *any* program that computes source code by *any* means can
> be called a "preprocessor".  I have a Smalltalk-to-C compiler.
> You could call that a preprocessor if you like.  I don't think
> the word itself helps our understanding very much.

It can be called that, but nobody did so and I'm not sure what that
has to do with the current issue.

> We have Lisp-Flavoured Erlang.  If you want preprocessing that
> can "intelligently" deal with Erlang source code, LFE is _it_.

LFE can intelligently preprocess LFE source code which is quite
different than Erlang source code. How does it help me handle a
vanilla Erlang module in erlide or emacs?

> There is of course a much better way to deal with regular
> expressions in a language like Lisp or Erlang.  One of my pet
> slogans is "STRINGS ARE WRONG".

I suppose that you mean something like "embedded strings in a language
are wrong when representing anything else than plain text". And I
couldn't agree more, they are evil - strings that represent for
example a regexp should be a different data type than a text message
string.

> The way to represent something
> like "^[[:alpha:]_][[:alnum:]_]*:[[:space:]]" is
>        rex:seq([rex:bol(),rex:id(),rex:space()])
> where regular expression syntax is replaced by Erlang syntax.
> This is so much more powerful than fancy quoting schemes for
> strings that it just isn't funny: you can compute any subexpression
> at any time you find useful _without_ new syntax, and without any
> run-time parsing.

[I am sure you already know all of the following, Richard, but from
your answer above you might have forgot it in the spur of the moment]

The same could be said about writing Erlang or C or Java parse trees
directly instead of letting the parser build them for us from a
string. Yet we don't do that because the textual representation has
some advantages: it's easier to read, it is higher level, it's easier
to modify and we're not bound to a specific internal representation.

The whole point with a parser is that the resulting AST is equivalent
to the input string. If the textual representation has restrictions on
what it can express, then it is so because the designer deemed it best
so (or it's a bug, but we can ignore that here). Bypassing that and
going directly to the parse tree might open a whole new can of worms.
For embedded languages that are more complicated than regexps or xml,
it might also be practically impossible to get it right manually.

Regexps are (as you say) a structured datatype. Nobody disagrees. But
we have a widespread, standard and compact way to represent them. Why
wouldn't we want to use that instead of Erlang terms? Given a compiler
that understands this, the following examples will generate exactly
the same code:
    identifier() -> {seq,{cset,letters()},{star,{cset,continuers()}}}.
    identifier() -> "{letters}{continuers}*".
I know which one I find easier to read and understand.

Regarding your security concerns about cross-scripting, I don't think
they are 100% relevant in this discussion. Those problems appear when
one takes a string from the external world and "pastes" it mindlessly
inside a program that is then executed. We are talking here about
being able to let a string (the erlang source file) be tokenized and
parsed by several scanners and parsers. There is no part in this
string that is injected from the outside so that the programmer's
intentions can be abused.

All in all, regular expressions are just a particular case of embedded
language. If there is to be any change to the Erlang syntax, I
wouldn't want it tailored to a specific language. For example, I want
to be able to embed Erlang code inside Erlang, which would allow
macros like LFE has and other goodies.

best regards,
Vlad