[erlang-questions] Reified environments (was: Re: Package Support/Use)

Tue Nov 7 06:28:43 CET 2006

I've been asked what I meant when I talked about reifying module
environments and having more than one of them.  You can think of
this message as a *draft* description.  To my surprise, it turns
out that this approach not only does most of what I want without
too much hassle, it can even provide something uncommonly like
packages!

One reason that this should be considered a draft is that there's
a programmatic interface here, but no declarative interface, and
the "application" support in OTP should be extended to provide a
declarative interface.

In the description I use variable names with a leading % sign and
unquoted function names with a leading % sign.  These are not to
be taken literally; they are just a way of talking.

To start with, I propose to distinguish between three different
representations of a module.  Two of them are file based, and one
isn't.

		+---------------+
		| Source form	|			.erl, .yrl, &c
		+---------------+
			|
			|	compile
			v
		+---------------+
		| Object form   |			.jam, .beam, &c
		+---------------+			but also a data type
			|
			|	bind
			v
		+---------------+
		| Bound form    |			only a data type
		+---------------+

Compiling the source form of a module gives you the module in object form.
These are normally though of as file based, although I have noted before
that you could use compressed abstract syntax trees as an object form and
that with a good compressor, the whole of the Erlang sources for Erlang/OTP
would fit neatly into an amount of memory that the average Java program
would sneer at as too small for Hello World.  The point of Object form is
to be something machine-independent with all the work other than native
code generation that you'd expect to be done at compile time, done.

However, the idea of modules with parameters has already been introduced
into Erlang.  And there is one more parameter that we don't normally
think about, because at the moment there is only one of it.

While most of Erlang is based on pure functions, there are some notable
exceptions.  And some of those exceptions are pretty bad:

 - each atom names a global variable whose value is a module (or nothing).
 - each atom names a global variable whose value is a process (or nothing).

We have a *single* global name space for modules.  (Packages do not really
change this.)  Let's call that %ME (for Module Environment).

-import(M, [F/N]).

acts as if

    F(X1, ..., XN) -> M:F(X1, ..., XN).

had been written.

    M:F(X1, ..., XN)

as a function call acts like

    (%deref(%flookup(%mlookup(%ME, M, KM), F, N, KF))
    )(X1, ..., XN)

where

    KM is true if M is an atom, false if it is a variable,
    KF is true if F is an atom, false if it is a variable.

    %mlookup(Module_Environment, Module, Create)
	when module(Module) -> Module;
    %mlookup(Module_Environment, Module, Create)
	when atom(Module) ->
	look the Module name up in the Module_Environment,
	if it was found ->
	    the result is that module
	 ; true = Create ->
	   create a module object with no exports,
	   install it in the module environment,
	   the result is that new module
	 ; true ->
	   raise an error
    %mlookup(Module_Environment, Module, Create) ->
    	raise an error because Module is the wrong type of thing.

and

    %flookup(Module, Name, Arity, Create)
	when function(Name) ->
	if arity(Name) == Arity ->
	    the result is the address of the function
	 ; true ->
	    raise an error
 	end;
    %flookup(Module, Name, Arity, Create)
	when module(Module), atom(name), integer(Arity), Arity >= 0 ->
	look up Name/Arity in Module,
	if a {Function,Name,Arity} block was found ->
	    the result is the address of the function
	 ; true = Create ->
	    create a function which will raise an error,
	    install it in the Module,
	    the result is that new function
	 ; raise an error
	end;
    %flookup(Module, Name, Arity, Create) ->
	raise an error.

At this point I am not concerned about {Module,Name} pairs.  I take it
that they are going to disappear.

The reason that I made the %deref part explicit is that we expect
    m:f(X, Y)
to turn into something like
    ld 'm:f/2', %l7
    ld [X], %o0
    call %l7
    ld [Y], %o1
which is just one load instruction longer than a normal call.
Part of the normal process of loading an object module into memory
and preparing it for execution is doing the m:f/2 lookup at load
time.

Now that modules have parameters, we expect something more to happen:
when we supply parameters to a module, we expect "preparing for execution"
to involve substituting parameter values for parameters and doing constant
propagation and any other optimisations enabled by that.  I call the
step of filling in parameters, including the addresses where the addresses
of imported functions are/will be stored, constant propagation, other
optimisation, and generation of native code BINDING.

In the present setup, there is only one module environment, so every module
has the same value for %ME.  But there is no reason why that HAS to be so.
And supplying a module with an environment to use for %ME is no different
in principle from supplying a binding for any other parameter.

The result of binding a module (to an environment, any other parameters,
and of course a particular node) is a data structure in the shared heap
of a node.  While there is a name in the -module declaration, the module
is not yet known by that name.  A bound module is *just* a data structure.
It referred to *other* modules by name, but that doesn't mean that they
can refer to it.

So imagine we have

    {ok,Object} = %compile(Source),
    {ok,Module} = %bind(Object, %ME, [Arguments]),

Now we have something that we can pass around and invoke functions in:

    Module:module_info()

for example.  There's one more step:  INSTALLING.  A module may be
installed under some name.

    ok = %install(%ME, Name, Module)

A module may be installed under any number of names.  If I want to talk
about colour:yellow() and you want to talk about color:yellow(), there's
no reason why we can't be talking about the same thing.  (Currently there
is an assumption that two different module names refer to two different
modules, but since nothing stops the bodies of the source forms being
identical, I doubt whether anyone relies on that.)

You can install a module under a name that is not used, or that has a
dummy definition created by %mlookup().  It might be advisable to have
a separate

    Old_Module = %reinstall(%ME, Name, Module)

for when you expect the module name to be in use, and

    Old_Module = %uninstall(%ME, Name)

for when you want to undefine a module.  All these names have % signs;
just as the low level code management functions in Erlang shouldn't be
used in ordinary user code, so these things need to have higher level
wrappers.  That's another reason this is a draft.

All of this would just be a pointless variation on what we have now
except for one thing.  Who said there could only be one module environment?

Suppose we want a flotilla of three modules 'a', 'b', 'c', with 'a' as the
facade of the whole thing, called 'm'.

    E = %module:new_environment(),
    {ok,AO} = %module:load(a),
    {ok,BO} = %module:load(b),
    {ok,CO} = %module:load(c),
    {ok,AB} = %module:bind(AO, E, []),
    {ok,BB} = %module:bind(BO, E, []),
    {ok,CB} = %module:bind(CO, E, []),
    ok = %module:install(E, AB, %module:name(AO)),
    ok = %module:install(E, BB, %module:name(BO)),
    ok = %module:install(E, CB, %module:name(CO)),
    ok = %module:install(%ME, AB, m)

In environment E, the only modules are a, b, c.
In %ME, the module a is accessible under the name m,
and b and c cannot be named at all.

This fixes one weakness in the package system, which is that
the package system does not provide any way of protecting the internal
modules of an application.  If one module can access module X under
the package system, then EVERY module can access module X.  Under the
"multiple environment" scheme, we get encapsulation we can rely on.

Of course, it's too bad that E doesn't contain any of the built in
Erlang modules.  So we really need

    E = %module:new_environment(),
    ok = %module:install_erlang_modules(E),
    ...

Now we do have a problem, and it's this.  Under the current scheme,
if a module m1 wants to tell module m2 to call back to m1 for some
things, it can do

    -module(m1).
    ...
    f(Init) ->
	m2:start(Init, m1).

That doesn't work to well in the reified environments scheme,
where we expect m1 to be usable when it doesn't *have* a name, not yet
being registered anywhere, or when it is installed under some other name,
and where we would like m2 to be able to call back to m1 despite m1 and
m2 having different name spaces.  The answer is to use

    -module(m1).
    ...
    f(Init) ->
	m2:start(Init, ?MODULE).

and just change the expansion of ?MODULE to something that returns the
actual module, not its name.

Passing module names around between modules that were *bound* in the
same environment *does* work, without any rewriting.

Unlike packages, this doesn't actually involve any language changes, or
any source-to-source transformation done by the compiler, so there are no
corner cases where the transformations don't quite work.

Now we add one more wrinkle.  Instead of a module environment being
a map from module names to modules, it becomes a PAIR of maps, one from
module names to modules, and one from environment names to modules.
I have to go home in a few minutes, so I'm going to just outline the idea.

    M1:M2:M3:F(X1, ..., Xn)

=>  (%deref(%flookup(%mlookup(%elookup(%elookup(%ME, M1), M2), M3, KM),
                     F, N, KF)))(X1, ..., Xn)

where %elookup looks up an environment in the "environment fork" of an
environment, while

    :M1:M2:M3:F(X1, ..., Xn)

=> (%deref(%flookup(%mlookup(%elookup(%elookup(%EE, M1), M2), M3, KM),
                     F, N, KF)))(X1, ..., Xn)

where %EE is an Erlang "root environment".  And thus we get packages.
You can use names like :erlang:lists:reverse(Xs) or you can just
ensure that lists:reverse(Xs) works by installing :erlang:lists in your
environment.

The thing I have to skip now is what something like m1:m2:m3 or
:m1:m2:m3 means on its own, and yes, that can be done in a context-
independent way.