embedding other languages in source code (was: regexp escapes)

Thu Jun 4 15:33:53 CEST 2009

Hi Richard (and all other interested),  (to the others, sorry for
taking so much bandwidth, please let me know if this is waaay off
topic)

On Thu, Jun 4, 2009 at 04:09, Richard O'Keefe <ok@REDACTED> wrote:
> We need one conceptually simple approach that can be used to nest
> and dynamically create instances of any number of languages.
> Trees are much much better at that job than strings.

Earlier you gave this example:

> Imagine that you have
>  - string
>  - inside a JavaScript expression
>  - inside an XML attribute
>  - where the XML has to appear as data in an Erlang program.

I don't think that such a non-trivial beast built manually as a data
structure would be maintainable. In this particular example, 99% of
the structure of the JavaScript and XML is irrelevant -- one only
wants to put configure the JS expression with an computed value. Even
if someone would want to write this as a tree, I don't think many
people would manage it without writing the XML+js snippet in a text
editor and parsing it by hand. The xml might work because it's a very
simple structure, but who can write js parse trees directly?

And IMHO this is the crux of our difference of opinions: there are
embedded languages that are structurally simple but have quirks that
make streamlined representations error-prone. For example, regexps
where it's easy to get confused by all the syntaxes in circulation. Or
xml/html where you have to match end tags. These are good candidates
for a direct tree representation, where it matters. If I have to
generate a 200K xml file where only 20 attributes are computed, it
would be much easier to use a template, parse it and process it.

There are also languages that are complex and whose parse trees aren't
trivial. In simple cases, it's still possible to use trees, but If I
understood correctly, we are looking for a general solution that works
even when I need to generate a C program that generates an Erlang
program that generates HTML.

Please allow me try to summarize so that we try to eliminate
misunderstandings. If I'm describing something different than you have
in mind, I apologize for the noise (but maybe we can start a new
discussion :-).

Context
-----------
It is sometimes needed to embed a piece of code in language Y inside
source code in language X (where X might be the same as Y). Examples
are regular expressions, io format strings, html, code generation. The
Y code might recursively contain code in other languages.

If the Y code has to be able to refer to entities in X, the Y
representation must be extended with some meta-items. For example, if
X="world", then "hello, {X}" (or the corresponding parse tree {seq,
[{text, "hello, "}, {var, 'X'}]}) could be interpreted as equivalent
to io_lib:format("hello, %s", [X]).

Code that originates from external sources is a different issue: it
already is a data structure that we can handle as we see fit. It is
the embedded code that is a problem, because we don't have as much
control over the parser and compiler.

Problem
------------
Which is the best way to represent the Y code inside X?

Alternatives
----------------
There are three ways to represent Y code in X (details below) that I
can think of:
    A) as a string, using Y's concrete syntax. This is either
processed as a string, using search and replace techniques, or is
parsed before being processed.
    B) as a parse tree, a regular data structure of the X language.
The parsing is done by the programmer, by hand.
    C) as code, a special source code construct. The Y concrete syntax
is used but the parser and compiler are aware of the nature of this
construct.
If anyone has another idea, please let me know.

(A)
This is the method most used, especially when the Y snippet is
short/compact (regexps are an example).

+ the concrete syntax is familiar to users.
+ usually, this representation is compact and easy to read (as much as
Y is easy to read).
+ meta-items can be avoided by building the string dynamically (but see below)
- inside strings, some characters need to be escaped. This may be
confusing and makes things harder to read, especially if X and Y use
the same escape sequence.
- it is possible to ensure at compile time that the content of the
string is a well-formed Y program only if the string is a constant and
if language X supports running custom code at compile-time. Even then,
it doesn't work for multi-level embedding.
- building the string dynamically is prone to errors that are
difficult to find (only at run-time)

(B)
This is usually used for Y that have regular and relatively simple
structure and/or whose concrete syntax has high redundancy, like XML.

+ easy to parametrize the build method for code elements and reuse sub-snippets
+ the tree might be more compact than the concrete syntax
+ it is more difficult to build an invalid tree
+ a library can hide the data structure, if desired
- the tree might be less compact than the concrete syntax
- unless the tree data structure is explicit, it can only be an
expression (i.e. for example not a pattern to match on)
- (arguably) it may be harder to read than the concrete syntax
- if there are several embedding levels, the difficulty to read
increases even more

(C)
This requires support built in the parser for X (and for Y, if some
code is to be embedded in it, and so on). One example of X that
supports it is MetaML.

+ the concrete syntax is familiar to users.
+ usually, this representation is compact and easy to read (as much as
Y is easy to read).
+ we don't need to escape the content as much as for A
+ syntax of Y is checked at compile-time, malformed snippets can't be created
+ code expressions could be used everywhere it may make sense, for
example in patterns
+ easy to parametrize and reuse sub-snippets
+ the parser is (and the compiler can be made) fully aware of the
involved languages and can make sure there are no errors and also do
magic things like macro definitions à la lisp
- adding support for this in the X parser is not the easiest thing
- a compatible parser for the Y languages must be available from X
- there still remain some escaping problems: the delimiters for the
code snippet and the way meta-elements are represented. They could be
made configurable, similar to the `!.....! suggestion for regexps.

I'm sure i missed some points somewhere, hopefully nothing important.

As always, the best solution depends on the amount of generality one
wants, on the amount of work one can do upfront and on the input data
(the combination of X and Y).

best regards,
Vlad