[erlang-questions] RFC: On `inet_tcp_dist` and `erl_epmd` interaction

Mon Oct 24 15:27:20 CEST 2011

== Summary ==

    I've found out that it is "theoretically" possible to override the
behavior of the default `erl_epmd` module with a custom, but
"compatible" module, without touching the `kernel` application (only
through configuration directives). I've labeled this method as
"theoretical" because the way in which the modules `erl_epmd` and
`inet_tcp_dist` (or any of the `inet_*_dist` family) interact makes
them inseparable.

    I'm writing this email as I want to help in enabling the
overriding of the default `erl_epmd` module in a correct, simple, and
the least intrusive method possible. (By "I want to help" I mean I am
offering to discuss, write, document and test the code.)


== Problem description  ==

    As stated, there is a function `net_kernel:epmd_module`, which
conforming to the (source code) documentation should (quote): "return
module_name of erl_epmd or similar gen_server_module".
        https://github.com/erlang/otp/blob/OTP_R14B04/lib/kernel/src/net_kernel.erl#L1283

    Unfortunately its only usage is in `erl_distribution.erl` to start
the `gen_server` process.
        https://github.com/erlang/otp/blob/OTP_R14B04/lib/kernel/src/erl_distribution.erl#L39

    All the other important modules `inet_*_dist`, `net_adm` directly
use the module `erl_epmd`, without the `net_kernel` indirection.
        https://github.com/erlang/otp/blob/OTP_R14B04/lib/kernel/src/inet_tcp_dist.erl#L70
        https://github.com/erlang/otp/blob/OTP_R14B04/lib/kernel/src/inet_tcp_dist.erl#L254

    As a result it is impossible to actually replace the way in which
`inet_*_dist` modules resolve the transport layer address (more
exactly the port) of the other nodes.


== Problem analysis ==

    I think there are possible purposes of the `net_kernel:epmd_module`:
    a) to give the name of a module which should export a `start_link`
function, which in turn spawns a process, registering under the name
`erl_epmd` and responding to `erl_epmd` messages in a proper manner
(thus implementing the "internal" `erl_epmd` protocol); (and as a
backend, maybe the UDP EPMD protocol;)
    b) or to give the name of a module which should export the
`register_node/2`, `port_please/2`, `names/0`, and `names/1` functions
which should act according to the specs in `erl_epmd` (thus
implementing the `erl_epmd` "interface" / behavior);

    As such there is a decision between "implementing a message
protocol" or "implementing an interface". I.e.:
    * in the first case (implementing the `erl_epmd` internal
protocol) the overriding module receives messages, and responds to
them in a proper manner; but the "clients" still use the `erl_epmd`
module as a frontend (which in turn sends messages to the named
`erl_epmd` process);
    * in the second case *all* clients should use the overriding
module (via `net_kernel:epmd_module`), and this one in its turn is
free to implement the "interface" functions as it sees fit as long as
it doesn't break the spec;

    Now the way in which `net_kernel:epmd_module` is used (only once
to start the server) and the fact that all `inet_*_dist` modules use
directly the `erl_epmd` module, suggests that the initial plan was to
go with solution a) -- i.e. the overriding module should register a
process under the well established name, and it should respond to
messages. (This is also suggested by the documentation quote: "or
similar gen_server_module".)

    Unfortunately the way in which `erl_epmd` module is implemented
suggests method b). Actually it is even worse:
    * half of the functionality is implemented by delegating work to a
`gen_server` process, see `register_node` function:
        https://github.com/erlang/otp/blob/OTP_R14B04/lib/kernel/src/erl_epmd.erl#L108
    * and half is implemented by directly executing the code in the
"client" process, see `port_please` and `names` functions, which in
turn call `get_port` and `get_names`:
        https://github.com/erlang/otp/blob/OTP_R14B04/lib/kernel/src/erl_epmd.erl#L292
        https://github.com/erlang/otp/blob/OTP_R14B04/lib/kernel/src/erl_epmd.erl#L418


== Solution ==

    Now by me, method a) (as presented above, i.e. implementing the
internal `erl_epmd` protocol by a named process) is the one most
"in-line" with OTP principles. (But even b) could work.)

    Thus in order to touch as little as possible the existing code, I
would propose to:
    * update `erl_epmd` module, so that all the "public" functions
(i.e. `port_please`, `names`, etc.) in fact send a message through
`gen_server:call` to that process registered under the `erl_epmd` name
(as `register_node` does);
    * the default implementation in `erl_epmd` in `handle_call`,
spawns a new process where it calls the internal `get_port` or
`get_names` and replies to the original call via `gen_server:reply`;
(to keep the concurrency model as is now, without serializing
requests);


== Conclusion ==

    For me -- and the project I'm involved in -- it is really
imperative to be able to replace the way in which ports are resolved.
I could do this by branching OTP, and maintaining a set of patches.
But I would prefer (and I think it could benefit others too) to "fix"
the current situation.

    As stated in the summary, I'm offering to write the patch and test
it. But before I come up with a patch, I want to ask for feedback as
maybe I've missed something. Therefore any feedback is very important
to me.

    Thanks for the time (as the email is quite long) :)
    Ciprian.


    P.S.: The reason I want to replace the current `erl_epmd` module I
can describe in a different thread. (There are actually two different
but related reasons, one not being directly tied to this problem, but
both are related to the `-no_epmd` option, which I've tried to discuss
in a previous thread.)