embedding other languages in source code (was: regexp escapes)

Fri Jun 5 08:04:15 CEST 2009

On 5 Jun 2009, at 1:33 am, Vlad Dumitrescu wrote:

> Hi Richard (and all other interested),  (to the others, sorry for
> taking so much bandwidth, please let me know if this is waaay off
> topic)
>
> On Thu, Jun 4, 2009 at 04:09, Richard O'Keefe <ok@REDACTED>  
> wrote:
>> We need one conceptually simple approach that can be used to nest
>> and dynamically create instances of any number of languages.
>> Trees are much much better at that job than strings.
>
> Earlier you gave this example:
>
>> Imagine that you have
>> - string
>> - inside a JavaScript expression
>> - inside an XML attribute
>> - where the XML has to appear as data in an Erlang program.
>
> I don't think that such a non-trivial beast built manually as a data
> structure would be maintainable.

Except for the "Erlang" bit, people are trying to do this kind of
stuff in large numbers.  I have a fourth year student trying to
build a web site for some friends.  It wasn't my idea.  He'd agreed
to work on a compiler project, but was spending all his time on PHP.
I said I'd prefer him to pass on his project than fail on mine, and
I've also said that when you look at the business case for what he's
doing his friends might be better off if he _didn't_ build them a
web site.  So I'm not recommending any of this stuff.  But this kind
of horrible mish-mash of language *is* the kind of stuff that people
*are* trying to do.  And they very often do it wrong.

Since a string is a data structure of sorts, it is difficult to see
how a manually built data structure would be *less* maintainable.

> In this particular example, 99% of
> the structure of the JavaScript and XML is irrelevant -- one only
> wants to put configure the JS expression with an computed value. Even
> if someone would want to write this as a tree, I don't think many
> people would manage it without writing the XML+js snippet in a text
> editor and parsing it by hand. The xml might work because it's a very
> simple structure, but who can write js parse trees directly?

Every Lisp programmer on the planet?

The thing is, I am not advocating a single solution.  That is in
fact what I'm opposing.  In this case, SQL can lead the way.
In between "everything is a string literal" and "everything is
an explicit Lisp-like AST" there are intermediate positions
where you have templates with typed parameters, so that
	sql:generate("insert into [T: name] values (
		        [S: string], [Q: expr])",
	    [{'T',"fred",
	     {'S',"select nothing from nowhere"},
	     {'Q',sql:generate("select nothing from nowhere")}])
knows to generate quotes around S and not around Q, and may do other
checks as well.

I've written a lot of XML as S-expressions and find that it is,
if anything, rather easier than writing it as XML.  Certainly
less error-prone.

An earlier message of mine in this thread did mention 'cc and xoc.

As it happens, there's kit in my editor to do
	- quote region for language X
	- unquote region for language X
So I can take a chunk of text, select it, and say "turn this into
a C string."  I can then take a bigger chunk of text, and say
"turn this into a Haskell string".  It's table-driven.  I don't
have tables to say "turn this literal text into something that's
safe in a regular expression" because there is no one set of rules
I could use, sadly.

> And IMHO this is the crux of our difference of opinions: there are
> embedded languages that are structurally simple but have quirks that
> make streamlined representations error-prone. For example, regexps
> where it's easy to get confused by all the syntaxes in circulation. Or
> xml/html where you have to match end tags. These are good candidates
> for a direct tree representation, where it matters. If I have to
> generate a 200K xml file where only 20 attributes are computed, it
> would be much easier to use a template, parse it and process it.

And we can mix and match approaches.

The thing is that with a tree-based design it is much easier to
write tools to check that the JavaScript generated is _certainly_
well-formed, rather than (as with templates) _probably_ not as
badly formed as it might have been.

There is, for example, a Haskell package for manipulating HTML
so that it is a _Haskell_ type error if you try to generate an
HTML element inside something that cannot hold it.  I don't think
the Dialyzer type system is quite that capable, but it's a pointer.

> Context
> -----------
> It is sometimes needed to embed a piece of code in language Y inside
> source code in language X (where X might be the same as Y). Examples
> are regular expressions, io format strings, html, code generation. The
> Y code might recursively contain code in other languages.
>
> If the Y code has to be able to refer to entities in X, the Y
> representation must be extended with some meta-items. For example, if
> X="world", then "hello, {X}" (or the corresponding parse tree {seq,
> [{text, "hello, "}, {var, 'X'}]}) could be interpreted as equivalent
> to io_lib:format("hello, %s", [X]).
>
> Code that originates from external sources is a different issue: it
> already is a data structure that we can handle as we see fit. It is
> the embedded code that is a problem, because we don't have as much
> control over the parser and compiler.
>

This is where we part company.  Stuff that comes from external sources
is NOT a different issue as I see it, or rather, it shouldn't be.  And
it DOESN'T come as a data structure.  Someone in this thread has said
something like "we are GIVEN regular expressions as strings".

Templates are a fine tool, but they need _typed_ substitution,
because interpolation-as-a-string-literal and interpolation-as-
verbatim-text and other kinds of interpolation are different.

There are at least two issues.
One of them is "How do we get the data into Erlang in the first place"
and the other is "what do we do with it when it's got there."
"Strong quotes", whether they are XML <![CDATA[...]]> sections or
Python triple-quoted strings or Lua [===[...]===] or Eiffel's
"%<newline>...<newline>%" or whatever, address the "how do we get
some verbatim text into Erlang without worrying about Erlang
quoting rules" problem.  Or nearly so.  What happens next?
And then you find you STILL need some sort of Erlang-specific
quoting or unquoting.

Suppose I have a regular expression into which I want to interpolate
an Erlang string as literal text.  Whatever method we use to indicate
this, we _also_ have to provide a way to say no, no, no, I don't mean
that, I actually mean what those characters would normally mean in
a regular expression.

As I noted above, I can already type a regular expression into my
editor, put point at one end and mark at the other, and go
Ctrl-X " e (double-quote region for Erlang).  As far as I'm concerned,
that problem was solved so long ago that it's hard for me to remember
it being an issue, certainly not an issue that needs compiler changes.

It's the layering and mixing, and especially in the context of web
services, mixing dynamically acquired stuff from other sources, which
is NOT already a non-trivial data structure, just a string.

Vlad's analysis was detailed and thoughtful, and probably the most
helpful way for the thread to continue.

Oh, I should mentioned that I'm one of the people who worked on the
'format' commands in Quintus Prolog, and I always found it
problematic that the formats were _always_ decoded at run time.
I did actually restructure the implementation at one point so that
it _could_ be used as a preprocessor, but that was shelved while we
tried to figure out how macros could co-exist with the debugger.
I was inspired in that by the work of Dr Peter Fenwick of the
University of Auckland, whose "STUFOR" (STUdent FORtran) compiler
for the B6700 fully processed Fortran formats at compile time and
tried very hard to give precise error messages for format errors
at both compile and run time.  That's always been the gold standard
for me of format/template processing.