[erlang-questions] {eep} Metaprogramming in Erlang

Tue Oct 23 13:47:59 CEST 2007

Hello everybody,

I have been working on some ideas for quite a while and it didn't go
as fast as I would have liked, but now at least I have a draft that I
am happy with.

What I have in mind is (as the title says) adding metaprogramming
capabilities to Erlang.

The attached draft EEP is addressing some of the issues involved.
Since this offers a solution to some problems ventilated more or less
recently (last in row being Joel's XML rant :-), my belief is that it
might be an interesting topic to discuss.

The draft eep is inlined in this message (instead of being attached)
so that it will be easier to make comments next to the relevant part
of the document.

I welcome all input, so please don't feel shy!

best regards,
Vlad

---------------------------------------------------------------------------

EEP: XXXX
Title: Metaprogramming
Version: $Revision: $
Last-Modified: $Date: $
Author: Vlad Dumitrescu <vladdu55@REDACTED>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 23-Oct-2007
Erlang-Version: R12
Post-History:

Abstract
========

This EEP describes an extension to add elements of metaprogramming to
the language.  The basic ideas are those presented by Walid Taha et
al. in their work on multi-stage programming [1]_.  Code is given data type
status and operators for handling code are defined.

A prototype implementation is being worked on.

Background
==========

Metaprogramming is a powerful computing paradigm where programs and
data are interchangeable.  One can write programs that create and
manipulate other programs, that can then be run in order to obtain the
answer to one's problem.

This isn't the place to go into details about metaprogramming, there
is a lot to read in the literature and the Internet.  The source of
inspiration is the work of Walid Taha and team [1]_, please find more
information there.  Here we will focus on how to apply these
techniques in Erlang.

Motivation
==========

Being able to handle code as data has many use cases.  Some examples
with immediate benefits for current Erlang applications are:

Optimization by partial application
-----------------------------------

Computing a function of several parameters can be optimized by fixing
some of the parameters and generating a tailored function of the rest
of them. Since this generation is done offline, it may speed up
considerably the original call.

For example, the function ``pow(X, N)`` that raises x to the power of
n may be specialized for a given n. The generated function could look
like ::

  pow_27(X) ->
    P3 = X * X * X,
    P3 * P3 * P3.

which is faster than a generic implementation could be.

Embedding foreign languages
---------------------------

Code expressions may be written in any language, thus giving the
programmer instant access to the expressivity of that language.

For example, one may write ::

  <|sql| select id, name, address from users where age>=@MinimumAge |>

instead of concatenating strings or fiddling with a representation of
the above query using Erlang terms.

By extension, the same mechanism allows to embed any structured
textual data in its native format, not only code, for example XML::

  <xml|
    <person id="funnyFace">
      <name>
        <firstname>Bo</firstname>
        <lastname>Brady</lastname>
      </name>
    </person>
  |>

and use this notation even for matching.

Easier to write parse transforms
--------------------------------

If the "foreign" embedded language is Erlang itself, then writing code
that processes code becomes just as simple as writing regular code.
Compare ::

    [{match,
     1,
     {var,1,'Z'},
     {'fun',
         1,
         {clauses,
             [{clause,
                  1,
                  [{var,1,'X'}],
                  [],
                  [{tuple,
                       1,
                       [{var,1,'X'},
                        {call,
                            1,
                            {remote,1,{atom,1,erlang},{atom,1,now}},
                            []}]}]}]}}}]

with ::

    <| Z = fun(X)->{X, erlang:now()} end |>

True macros
-----------

Taking the next step with Erlang as an embedded language, one can
implement a macro system where the macros are not only simple textual
substitution mechanisms, but have the full power of the language at
their disposal.  This allows high level ideas to be expressed very
concisely and thus free the programmer from having to deal with
boilerplate code.

Specification
=============

Operators and Syntax
--------------------

The operators needed to implement metaprogramming facilities according
to [1]_ (where the description is adapted from) are described below.
Please note that in order to simplify the presentation, the values
shown for code expressions differ from actual values that will be
returned by an implementation.

Bracket
'''''''

Brackets can be inserted around any text to mark it as a code
fragment, represented at runtime as a parse tree.  A bracketed
expression can also be looked at as having a delayed execution time
(if the respective language is executable).

**Syntax**::

  <|Language:Func| Code |>
  <|Language| Code |>
  <|:Func| Code |>
  <| Code |>

``Language`` and ``Func`` are unquoted atoms.  There may be no
whitespace betwee the leftmost "``|``" characters.  ``Code`` may be
any text, with the only requirement that any "``<|``" and "``|>``"
inside it are balanced.  Otherwise, these tokens must be escaped like
"``\\<|``" or "``\\|>``" and the lexer will remove the escape
characters.  It is of course assumed that ``Code`` is a piece of valid
code in the language ``Language``.

In the shorter versions, the implied value for ``Language`` is ``erl``
and refers of course to Erlang itself.  The default value for ``Func``
is language-dependent, see below.

**Example**::

  > V = <| 5 + 3 |>.
  {op, '+', [{integer, 5}, {integer, 3}]}

Note that there is no separate type for code expressions, at run-time
they are indistinguishable from manually constructed parse trees.  It
would be useful to be able to distingush code expressions, not the
least so that the values could be printed to the console in a prettier
way::

  > V = <| 5 + 3 |>.
  <| 5 + 3 |>

That is however something left for future improvements.

Splice
''''''

Splice allows the combination of smaller code fragments to construct
larger ones.  This combination is achieved by "splicing-in" the
argument of the Splice in the context of the surrounding Brackets.

Splice combines snippets efficiently in the sense that the combination
of the subcomponents of the new computation is performed while the new
computation is being constructed, rather than while it is being
executed.

**Syntax**::

  ~Expression

The operator is interpreted slightly differently if it is encountered
inside a pattern or a regular expression.  Inside a regular
expression, it evaluates ``Expression`` and splices the result into
the body of a surrounding bracketed expression. Inside a pattern,
Expression must be a variable and the splice operator protects it from
the effect of bracketing.  More details later.

The value of ``Expression`` must be code based on the same language as
the surrounding brackets. Otherwise, the result is not defined.

The exact syntax for the splice operator is language dependent, there
isn't a single notation that would fit all possible languages.  In
case the default notation described here can't be used for a specific
language, the language support must provide an implementation of a
function named ``get_splice_syntax/0`` that returns a string
describing the syntax to look for. The argument of the splice operator
is specified as ``$$`` and any matching bracker-style characters
``{[((]}`` will be matched in the input too. A textual ``$$`` text has
to be escaped.

**Example**::

  > V = <| 5 + 3 |>.
  {op, '+', [{integer, 5}, {integer, 3}]}
  > <| {~V, ~V} |>.
  {tuple, [{op, '+', [{integer, 5}, {integer, 3}]},
           {op, '+', [{integer, 5}, {integer, 3}]}]}
  > foo_lang:get_splice_syntax().
  "bar{$$}"

Run
'''

Run allows the execution of a code fragment.  Having Run in the
language is important if we want to use code constructed using the
other Erlang constructs, without going outside the language.

Run is only meaningful for languages that represent code. The language
support must include appropriate mechanisms in the form of a ``run/1``
function that will get called with the respective argument.

**Syntax**::

  run(Expression)

**Example**::

  > V = <| 5 + 3 |>.
  {op, '+', [{integer, 5}, {integer, 3}]}
  > run(V).
  8

Lift
''''

Lift allows us to inject values of ground type into a value of type
code.  Ground types are types containing no arrows (function types);
that is because given a functional value we can't reconstruct its
source code representation.  However, functions can still be delayed
using Brackets::

  > F = fun(X) -> X+1 end.
  > V = <| F(5) |>.
  {call, F, [5]}
  > run(V).
  6

Lift is only meaningful for executable code. [[TODO: define language
support defining the syntax of lift]]

While both Brackets and Lift construct code, Lift does not delay its
argument.  Lift first evaluates its argument, then constructs a
representation for this value.

**Syntax**::

  lift(Expression)

**Example**::

  > V = lift(5 + 3).
  {integer, 8}

Variables and Levels
--------------------

If bracketed expressions are embedded inside other bracketed
expressions, the resulting code becomes *multi-level*.  Escaped
expressions have to be interpreted at the appropriate level and there
are two big issues to discuss.

Cross-Stage Persistence
'''''''''''''''''''''''

It feels natural to be able to write code like the following (where we
will use the "pretty" representation of code to increase readability)
::

  > A = 42, Code = <| fun(X) -> X + A end |>.
  <| fun(X) -> X + 42 |>

where A (defined at level 0) is treated as a free variable and thus
appears as a constant inside the code (at level 1).

Cross-Stage Safety
''''''''''''''''''

A variable violates cross-stage safety when it is bound at one level
and is used at a lower level. For example ::

  fun(X) -> <| fun(Y) -> ~(X+Y) |>

The annotations in this expression dictate computing X+Y in the first
stage, when the value of Y will be available only in the second stage!

In this implementation, the result will be a "variable not bound"
compilation error.

Applying Staging (an Example)
-----------------------------

Using these annotations, a programmer can modify the default strict
order evaluation of programs and modify a "normal" program to a staged
one. Let's take for example a function that searches a list for a value::

  %% member(term(), list()) -> boolean()
  member(_, []) -> false;
  member(V, [H|T]) ->
    if V==H -> true;
       true -> member(V, T)
    end.

We suppose the second argument, the list to be searched, is available
in the first stage and the other values (the first argument and the
return value) are delayed. This information can be used to apply
annotations so that the types of the values are kept consistent. The
result would be ::

  %% member(code(), list()) -> code()
  member2(_, []) -> <| false |>;
  member2(V, [H|T]) ->
    <| if ~V==~(lift(H)) -> true;
          true -> ~member2(V, T)
       end|>.

The result of calling ``<| fun(X) -> ~member2(X, [1, 2, 3]) end |>``
will be ::

  <| fun(X) ->
       if X==1 -> true;
          true -> if X==2 -> true;
                     true -> if X==3 -> true;
                                true -> false
                             end
                  end
       end
     end |>

Unfortunately, if using pattern matching in the function's clauses, I
couldn't find a way to perform staging without having to rewrite the
compiler. Work continues in this direction.

----

Parsing
-------

The construct described above will be parsed according to the
following rules:

- The code construct will be parsed as an expression *(TBD: define
  precedence)*.

- The handling of the code is delegated to a parser module, whose name
  is ``Language ++ "_parser"``.  If this module isn't found, an error
  is returned.

- The parser module must implement the following interface:

  * ``parse(Tokens) -> {ok, Tree} | {error, Error}``

    parses the given list of tokens and return a parse tree or an error.

  * ``get_default_func() -> string()`` [optional]

    returns the default value of Func.  If the function doesn't exist,
    an empty string is used.  This allows, for example, to specify if
    the Erlang parser will use ``parse_forms`` or ``parse_exprs`` as
    default.

  * ``parse_<Func>(Tokens) -> {ok, Tree} | {error, Error}`` [optional]

    If the code expression is declared with ``<|Lang:Func|``, then
    ``parse_Func/1`` will be invoked instead of ``parse/1``.

- The parse trees are composed of nodes represented as tuples.  The
  layout must be similar to the Erlang abstract format, with the node
  name as first element and a position indicator as the second.

- The MetaErlang parser takes the resulting parse tree and use it as
  the value of the code expression (after abstracting its
  representation).

- If the code expression is a pattern, it is necessary to replace the
  line numbers (or equivalent) with ``'_'``, so that different
  formatting styles of the same code do match.

Code Variables
--------------

When matching code expressions, one usually wants to extract
subexpressions.  This means that there must be a way to specify that
some entity in the code is just a placeholder.  The syntax for doing
this is language dependent, for Erlang the notation

    ``@VariableName``

is chosen. The notation ``@_`` will correspond to an unnamed variable.

If the language parser doesn't recognize code variables, then they
aren't supported for that language and their usage will yield an
exception.

When encountering a code variable, the language parser will return a
tree node ``{<code_variable_node_name>, Line, VariableName}``, where
``<code_variable_node_name>`` is an atom as returned by the language
support's ``get_code_variable_node/0`` API.  The MetaErlang parser
will detect these nodes and when abstracting the parse tree it will
replace them with a variable definition.

Taking the Erlang example above, writing ::

    <| @_ = fun(X) -> @Exprs end |> =
        <| Z = fun(X)->{X, erlang:now()} end |>``

will be equivalent to (rewritten as two matches for brevity) ::

    _ = {var, 1, 'Z'},
    Exprs = [{tuple,
              1,
              [{var,1,'X'},
               {call,
                1,
                {remote,1,{atom,1,erlang},{atom,1,now}},
                []}]}]

Of course, getting the results one wants requires deep knowledge of
the way the parse trees look like behind the simple-looking
syntax. Some things are even be impossible without further extensions,
so they will still require good ol' hands-on work (^^)/

Language support API summary
============================

The language support for language ``mylang`` is accessible from a
module named ``mylang_parser``. The API exposed by this module is as
follows:

  * ``get_splice_syntax() -> string()`` [optional]

    returns a string illustrating the syntax of the splice operator,
    with the argument represented as ``$$``.  Any ``{[()]}``
    characters must be balanced and will be matched accordingly in the
    input.  If not implemented, the default value of ``"~$$"`` will be
    assumed.

  * ``parse(Tokens) -> {ok, Tree} | {error, Error}``

    parses the given list of tokens and return a parse tree or an error.

  * ``get_default_func() -> string()`` [optional]

    returns the default value of Func.  If the function doesn't exist,
    an empty string is used.  This allows, for example, to specify if
    the Erlang parser will use ``parse_forms`` or ``parse_exprs`` as
    default.

  * ``parse_<Func>(Tokens) -> {ok, Tree} | {error, Error}`` [optional]

    If the code expression is declared with ``<|Lang:Func|``, then
    ``parse_Func/1`` will be invoked instead of ``parse/1``.

  * ``get_code_variable_node() -> atom()`` [optional]

    If the language supports variable bindings, the parser will return
    a "fake" node with this atom as name, which will be handled
    accordingly by the MetaErlang parser.  The default value is
    ``'@code_variable'``.

  * ``run(Expression) -> term()`` [optional]

    If the language is executable, this function evaluates the
    Expression (which must represent code in thet language) and
    returns the result as an Erlang term.

Open Issues
===========

- If the embedded language is Erlang, one might want to be able to use
  the code construct recursively.  This is not currently supported,
  but there is ongoing work in that direction.

- Some parsers may need extra parameters.  The above construct could
  be extended to allow for that, but the notation shouldn't be clumsy.

Reference Implementation
========================

A reference implementation exists, but it is incomplete. It will be
made available after processing the first feedback on this EEP, as I
suspect there will be substantial changes.

References
==========

.. [1] Multi-stage programming, Walid Taha and others
       http://www.cs.rice.edu/~taha/MSP/

Credits
=======

A lot of input comes from discussions with Richard O'Keefe, Ulf Wiger,
Yariv Sadan, Joe Armstrong and others.

Copyright
=========

This document has been placed in the public domain.

..
    Local Variables:
    mode: indented-text
    indent-tabs-mode: nil
    sentence-end-double-space: t
    fill-column: 70
    coding: utf-8
    End: