EEP: XXXX Title: Metaprogramming Version: $Revision: $ Last-Modified: $Date: $ Author: Vlad Dumitrescu Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 23-Oct-2007 Erlang-Version: R12 Post-History: Abstract ======== This EEP describes an extension to add elements of metaprogramming to the language. The basic ideas are those presented by Walid Taha et al. in their work on multi-stage programming [1]_. Code is given data type status and operators for handling code are defined. A prototype implementation is being worked on. Background ========== Metaprogramming is a powerful computing paradigm where programs and data are interchangeable. One can write programs that create and manipulate other programs, that can then be run in order to obtain the answer to one's problem. This isn't the place to go into details about metaprogramming, there is a lot to read in the literature and the Internet. The source of inspiration is the work of Walid Taha and team [1]_, please find more information there. Here we will focus on how to apply these techniques in Erlang. Motivation ========== Being able to handle code as data has many use cases. Some examples with immediate benefits for current Erlang applications are: Optimization by partial application ----------------------------------- Computing a function of several parameters can be optimized by fixing some of the parameters and generating a tailored function of the rest of them. Since this generation is done offline, it may speed up considerably the original call. For example, the function ``pow(X, N)`` that raises x to the power of n may be specialized for a given n. The generated function could look like :: pow_27(X) -> P3 = X * X * X, P3 * P3 * P3. which is faster than a generic implementation could be. Embedding foreign languages --------------------------- Code expressions may be written in any language, thus giving the programmer instant access to the expressivity of that language. For example, one may write :: <|sql| select id, name, address from users where age>=@MinimumAge |> instead of concatenating strings or fiddling with a representation of the above query using Erlang terms. By extension, the same mechanism allows to embed any structured textual data in its native format, not only code, for example XML:: Bo Brady |> and use this notation even for matching. Easier to write parse transforms -------------------------------- If the "foreign" embedded language is Erlang itself, then writing code that processes code becomes just as simple as writing regular code. Compare :: [{match, 1, {var,1,'Z'}, {'fun', 1, {clauses, [{clause, 1, [{var,1,'X'}], [], [{tuple, 1, [{var,1,'X'}, {call, 1, {remote,1,{atom,1,erlang},{atom,1,now}}, []}]}]}]}}}] with :: <| Z = fun(X)->{X, erlang:now()} end |> True macros ----------- Taking the next step with Erlang as an embedded language, one can implement a macro system where the macros are not only simple textual substitution mechanisms, but have the full power of the language at their disposal. This allows high level ideas to be expressed very concisely and thus free the programmer from having to deal with boilerplate code. Specification ============= Operators and Syntax -------------------- The operators needed to implement metaprogramming facilities according to [1]_ (where the description is adapted from) are described below. Please note that in order to simplify the presentation, the values shown for code expressions differ from actual values that will be returned by an implementation. Bracket ''''''' Brackets can be inserted around any text to mark it as a code fragment, represented at runtime as a parse tree. A bracketed expression can also be looked at as having a delayed execution time (if the respective language is executable). **Syntax**:: <|Language:Func| Code |> <|Language| Code |> <|:Func| Code |> <| Code |> ``Language`` and ``Func`` are unquoted atoms. There may be no whitespace betwee the leftmost "``|``" characters. ``Code`` may be any text, with the only requirement that any "``<|``" and "``|>``" inside it are balanced. Otherwise, these tokens must be escaped like "``\\<|``" or "``\\|>``" and the lexer will remove the escape characters. It is of course assumed that ``Code`` is a piece of valid code in the language ``Language``. In the shorter versions, the implied value for ``Language`` is ``erl`` and refers of course to Erlang itself. The default value for ``Func`` is language-dependent, see below. **Example**:: > V = <| 5 + 3 |>. {op, '+', [{integer, 5}, {integer, 3}]} Note that there is no separate type for code expressions, at run-time they are indistinguishable from manually constructed parse trees. It would be useful to be able to distingush code expressions, not the least so that the values could be printed to the console in a prettier way:: > V = <| 5 + 3 |>. <| 5 + 3 |> That is however something left for future improvements. Splice '''''' Splice allows the combination of smaller code fragments to construct larger ones. This combination is achieved by "splicing-in" the argument of the Splice in the context of the surrounding Brackets. Splice combines snippets efficiently in the sense that the combination of the subcomponents of the new computation is performed while the new computation is being constructed, rather than while it is being executed. **Syntax**:: ~Expression The operator is interpreted slightly differently if it is encountered inside a pattern or a regular expression. Inside a regular expression, it evaluates ``Expression`` and splices the result into the body of a surrounding bracketed expression. Inside a pattern, Expression must be a variable and the splice operator protects it from the effect of bracketing. More details later. The value of ``Expression`` must be code based on the same language as the surrounding brackets. Otherwise, the result is not defined. The exact syntax for the splice operator is language dependent, there isn't a single notation that would fit all possible languages. In case the default notation described here can't be used for a specific language, the language support must provide an implementation of a function named ``get_splice_syntax/0`` that returns a string describing the syntax to look for. The argument of the splice operator is specified as ``$$`` and any matching bracker-style characters ``{[((]}`` will be matched in the input too. A textual ``$$`` text has to be escaped. **Example**:: > V = <| 5 + 3 |>. {op, '+', [{integer, 5}, {integer, 3}]} > <| {~V, ~V} |>. {tuple, [{op, '+', [{integer, 5}, {integer, 3}]}, {op, '+', [{integer, 5}, {integer, 3}]}]} > foo_lang:get_splice_syntax(). "bar{$$}" Run ''' Run allows the execution of a code fragment. Having Run in the language is important if we want to use code constructed using the other Erlang constructs, without going outside the language. Run is only meaningful for languages that represent code. The language support must include appropriate mechanisms in the form of a ``run/1`` function that will get called with the respective argument. **Syntax**:: run(Expression) **Example**:: > V = <| 5 + 3 |>. {op, '+', [{integer, 5}, {integer, 3}]} > run(V). 8 Lift '''' Lift allows us to inject values of ground type into a value of type code. Ground types are types containing no arrows (function types); that is because given a functional value we can't reconstruct its source code representation. However, functions can still be delayed using Brackets:: > F = fun(X) -> X+1 end. > V = <| F(5) |>. {call, F, [5]} > run(V). 6 Lift is only meaningful for executable code. [[TODO: define language support defining the syntax of lift]] While both Brackets and Lift construct code, Lift does not delay its argument. Lift first evaluates its argument, then constructs a representation for this value. **Syntax**:: lift(Expression) **Example**:: > V = lift(5 + 3). {integer, 8} Variables and Levels -------------------- If bracketed expressions are embedded inside other bracketed expressions, the resulting code becomes *multi-level*. Escaped expressions have to be interpreted at the appropriate level and there are two big issues to discuss. Cross-Stage Persistence ''''''''''''''''''''''' It feels natural to be able to write code like the following (where we will use the "pretty" representation of code to increase readability) :: > A = 42, Code = <| fun(X) -> X + A end |>. <| fun(X) -> X + 42 |> where A (defined at level 0) is treated as a free variable and thus appears as a constant inside the code (at level 1). Cross-Stage Safety '''''''''''''''''' A variable violates cross-stage safety when it is bound at one level and is used at a lower level. For example :: fun(X) -> <| fun(Y) -> ~(X+Y) |> The annotations in this expression dictate computing X+Y in the first stage, when the value of Y will be available only in the second stage! In this implementation, the result will be a "variable not bound" compilation error. Applying Staging (an Example) ----------------------------- Using these annotations, a programmer can modify the default strict order evaluation of programs and modify a "normal" program to a staged one. Let's take for example a function that searches a list for a value:: %% member(term(), list()) -> boolean() member(_, []) -> false; member(V, [H|T]) -> if V==H -> true; true -> member(V, T) end. We suppose the second argument, the list to be searched, is available in the first stage and the other values (the first argument and the return value) are delayed. This information can be used to apply annotations so that the types of the values are kept consistent. The result would be :: %% member(code(), list()) -> code() member2(_, []) -> <| false |>; member2(V, [H|T]) -> <| if ~V==~(lift(H)) -> true; true -> ~member2(V, T) end|>. The result of calling ``<| fun(X) -> ~member2(X, [1, 2, 3]) end |>`` will be :: <| fun(X) -> if X==1 -> true; true -> if X==2 -> true; true -> if X==3 -> true; true -> false end end end end |> Unfortunately, if using pattern matching in the function's clauses, I couldn't find a way to perform staging without having to rewrite the compiler. Work continues in this direction. ---- Parsing ------- The construct described above will be parsed according to the following rules: - The code construct will be parsed as an expression *(TBD: define precedence)*. - The handling of the code is delegated to a parser module, whose name is ``Language ++ "_parser"``. If this module isn't found, an error is returned. - The parser module must implement the following interface: * ``parse(Tokens) -> {ok, Tree} | {error, Error}`` parses the given list of tokens and return a parse tree or an error. * ``get_default_func() -> string()`` [optional] returns the default value of Func. If the function doesn't exist, an empty string is used. This allows, for example, to specify if the Erlang parser will use ``parse_forms`` or ``parse_exprs`` as default. * ``parse_(Tokens) -> {ok, Tree} | {error, Error}`` [optional] If the code expression is declared with ``<|Lang:Func|``, then ``parse_Func/1`` will be invoked instead of ``parse/1``. - The parse trees are composed of nodes represented as tuples. The layout must be similar to the Erlang abstract format, with the node name as first element and a position indicator as the second. - The MetaErlang parser takes the resulting parse tree and use it as the value of the code expression (after abstracting its representation). - If the code expression is a pattern, it is necessary to replace the line numbers (or equivalent) with ``'_'``, so that different formatting styles of the same code do match. Code Variables -------------- When matching code expressions, one usually wants to extract subexpressions. This means that there must be a way to specify that some entity in the code is just a placeholder. The syntax for doing this is language dependent, for Erlang the notation ``@VariableName`` is chosen. The notation ``@_`` will correspond to an unnamed variable. If the language parser doesn't recognize code variables, then they aren't supported for that language and their usage will yield an exception. When encountering a code variable, the language parser will return a tree node ``{, Line, VariableName}``, where ```` is an atom as returned by the language support's ``get_code_variable_node/0`` API. The MetaErlang parser will detect these nodes and when abstracting the parse tree it will replace them with a variable definition. Taking the Erlang example above, writing :: <| @_ = fun(X) -> @Exprs end |> = <| Z = fun(X)->{X, erlang:now()} end |>`` will be equivalent to (rewritten as two matches for brevity) :: _ = {var, 1, 'Z'}, Exprs = [{tuple, 1, [{var,1,'X'}, {call, 1, {remote,1,{atom,1,erlang},{atom,1,now}}, []}]}] Of course, getting the results one wants requires deep knowledge of the way the parse trees look like behind the simple-looking syntax. Some things are even be impossible without further extensions, so they will still require good ol' hands-on work (^^)/ Language support API summary ============================ The language support for language ``mylang`` is accessible from a module named ``mylang_parser``. The API exposed by this module is as follows: * ``get_splice_syntax() -> string()`` [optional] returns a string illustrating the syntax of the splice operator, with the argument represented as ``$$``. Any ``{[()]}`` characters must be balanced and will be matched accordingly in the input. If not implemented, the default value of ``"~$$"`` will be assumed. * ``parse(Tokens) -> {ok, Tree} | {error, Error}`` parses the given list of tokens and return a parse tree or an error. * ``get_default_func() -> string()`` [optional] returns the default value of Func. If the function doesn't exist, an empty string is used. This allows, for example, to specify if the Erlang parser will use ``parse_forms`` or ``parse_exprs`` as default. * ``parse_(Tokens) -> {ok, Tree} | {error, Error}`` [optional] If the code expression is declared with ``<|Lang:Func|``, then ``parse_Func/1`` will be invoked instead of ``parse/1``. * ``get_code_variable_node() -> atom()`` [optional] If the language supports variable bindings, the parser will return a "fake" node with this atom as name, which will be handled accordingly by the MetaErlang parser. The default value is ``'@code_variable'``. * ``run(Expression) -> term()`` [optional] If the language is executable, this function evaluates the Expression (which must represent code in thet language) and returns the result as an Erlang term. Open Issues =========== - If the embedded language is Erlang, one might want to be able to use the code construct recursively. This is not currently supported, but there is ongoing work in that direction. - Some parsers may need extra parameters. The above construct could be extended to allow for that, but the notation shouldn't be clumsy. Reference Implementation ======================== A reference implementation exists, but it is incomplete. It will be made available after processing the first feedback on this EEP, as I suspect there will be substantial changes. References ========== .. [1] Multi-stage programming, Walid Taha and others http://www.cs.rice.edu/~taha/MSP/ Credits ======= A lot of input comes from discussions with Richard O'Keefe, Ulf Wiger, Yariv Sadan, Joe Armstrong and others. Copyright ========= This document has been placed in the public domain. .. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End: