EEP 53: Process aliases preventing late replies reaching clients

Thu Oct 29 16:31:38 CET 2020

EEP 53 has been updated with the addition of a new tag option to the
monitor/3 BIF. This in order to address an issue raised in the PR <
https://github.com/erlang/otp/pull/2735> containing the prototype
implementation. The updated EEP follows.

Regards,
Rickard Green

    Author: Rickard Green <rickard(at)erlang(dot)org>
    Status: Draft
    Type: Standards Track
    Erlang-Version: 24.0
    Created: 01-Sept-2019
    Post-History:
****
EEP 53: Process aliases preventing late replies reaching clients
----

Abstract
========

Currently there exists no lightweight mechanism for preventing late replies
from a server to a client after a timeout or connection loss has occurred.
The only way to prevent late replies today is to make the request via
a proxy process.

The proposed process alias feature is a lightweight mechanism that solves
the above problem. A process alias is similar to a registered name that
is used temporarily while a request is outstanding. If the request times
out or the connection to the server is lost, the alias is deactivated which
prevents a late reply from reaching the client.

Copyright
=========

This document has been placed in the public domain.

Specification
=============

An alias is of the Erlang type `reference()` and can be used as destination
when sending using the `!` operator, or when sending using the
`erlang:send()`
and `erlang:send_nosuspend()` BIFs. An alias can be used both on local
node and on remote nodes in a distributed system. The alias identifies a
process that exist, or has existed, on the node with the node name returned
by `node(Alias)`.

All references will, as of this, be accepted as destination in the message
send
operations listed above. If the reference is not an alias or a previous
alias
that has been deactivated, the message will silently be dropped.

These new BIFs are introduced:

* `alias/0`, `alias/1`. The `alias()` BIF creates and returns an alias which
  can be used when sending messages to the process that called the `alias()`
  BIF.

* `unalias/1`. The `unalias(Alias)` BIF deactivates an alias that identifies
  the calling process. The BIF returns `true` if the alias `Alias`
identified
  the calling process and thus was deactivated; otherwise, no change of the
  alias state was made and `false` is returned.

* `monitor/3`. The `monitor/3` BIF is an extension of the `monitor/2` BIF
  where the third argument is an option list. As of its introduction it
  accepts two options:
    * '{alias, unalias | demonitor | reply_demonitor}'. The first element of
      the two tuple indicates that we want the returned monitor reference to
      also work as an alias. The second element determines how the alias
should
      be deactivated:
        * `unalias` - The alias will remain until it has been deactivated
by the
          `unalias/1` BIF
        * `demonitor` - The alias will be deactivated when the monitor is
          deactivated. That is, either when the `demonitor()` BIF is called
on
          the monitor, or when the monitor is automatically deactivated by
the
          reception of a `'DOWN'` message. The alias can still be
deactivated
          before this happens by calling the `unalias/1` BIF.
        * `reply_demonitor` - The alias will be deactivated when either the
          monitor is deactivated or a message that has been passed using the
          alias is received. If the alias is deactivated due to a message
passed
          using the alias, the monitor is also deactivated as if the
`demonitor()`
          BIF had been called.

    * `{tag, UserDefinedTag}`. This will replace the default `Tag` with
      `UserDefinedTag` in the monitor message delivered when the monitor is
      triggered. For example, when monitoring a process, the `'DOWN'` tag in
      the down message will be replaced by `UserDefinedTag`.

The `spawn_opt()` and `spawn_request()` BIFs have also been extended to
accept an option `{monitor, MonitorOpts}` where `MonitorOpts` correspond to
the option list of the `monitor/3` BIF.

Full documentation of these BIFs and options can be found via
[pull request #2735](https://github.com/erlang/otp/pull/2735)
containing the reference implementation.

It is not possible to retrieve the process identifier of the process
identified by an alias, and it is not possible to test if a reference is an
alias or not.

Motivation
==========

As previously stated it is possible to prevent late replies by using a
proxy process that forwards the reply to the client. By spawning the proxy
process and send its process identifier to the server instead of the
clients own process identifier, the proxy can be terminated when the
operation times out or the connection is lost. Since the proxy process
is not alive, a reply will be silently dropped and no stray message
will reach the previous client of the request. This however both makes
the code more complicated and less efficient than it needs to be. The
inefficiency comes from both the need to create, schedule, execute, and
terminate the proxy process and the extra copying of data over the proxy
process.

When the author of the client code has full control over the client process
such late replies can be handled without a proxy since the code can be
aware of these potential stray messages and drop them when received. This
is, however, not possible when implementing library code. You then either
need to use a proxy process, as done by the `gen_statem` behavior, or
accept that the client process may get stray messages after a call, as
done by the `gen_server` behavior.

Process aliases solves these issues with a very small overhead.

Rationale
=========

Why use the reference data type for alias?
------------------------------------------

This is more or less what the reference data type is there for. A data type
that can identify a huge amount of different entities. References are unique
and contain a node identifier identifying the the node it originates from.
This makes it easy to identify a specific process on a specific node while
also identifying different aliases created by the same process. The embedded
node identifier makes it easy to provide distribution transparency.

Why not make alias an opaque data type?
---------------------------------------

The expected most common use case is in a client server request. Such as
`gen_server:call()`. Client server requests in Erlang are typically made
while monitoring the server from the client. In order to minimize the data
produced and sent in the request we want to reuse the reference created for
identification of the monitor to also function as an alias. Since the
monitor
identifier is documented as a reference and is not opaque (which one can
argue was a design mistake when introducing monitors), it becomes hard not
to document the type of an alias as a reference as well.

Why not allow references as registered names in the already existing API?
-------------------------------------------------------------------------

There are two reasons. Distribution transparency and scalability.

Distribution transparency is really desirable since the user can use the
functionality the same way regardless of whether it is a node local
operation
or node remote operation. The name registration API is not distribution
transparent.

Regarding scalability. Due to how the name registration API has been
designed
we need some sort of table in order to implement the API. This table will be
written to and read from by processes that are executing in parallel. In the
use case we are focusing on, names (aliases) are expected to be temporary
and
created in huge amounts. That is, there will be large amounts of
modifications
of this table from processes executing on different processors. This will
make it challenging to implement such a table that scales well.

In the proposed solution the information needed to route the message to the
correct place is saved in the alias itself, the reference. The information
needed to determine if the message passed via the alias should be dropped or
passed along is saved in the process identified by the alias. That is, all
information needed is distributed to where it is needed instead of being
centralized in a node global table. This approach of distributed information
introduce no new synchronization points at all when it has been fully
implemented (more on that below) which will scale extremely well. An
implementation based on a node global table can *never* compete scalability
wise with that.

The already existing functionality for registered names cannot be
implemented
using this distributed information approach, but needs this centralized
storage of names. That is, the already existing API cannot be used.

Besides node identifier a reference today contains three 32-bit words of
data
or in other words 96-bits of data. Of these 96 bits only 82 bits are allowed
to be passed over the distribution to another node. This for historical
reasons. While a reference resides locally it can however contain more or
less unlimited amount of data. 82-bits are not enough to make a reference
unique on the local node and at the same time uniquely identify a node local
process. In order to be able to store all information needed in alias, the
reference data type needs to be extended.

In the proposed solution references used as aliases are extended to use
five 32-bit words on 64-bit architectures and four 32-bit words on 32-bit
architectures. Since that much data in a reference cannot be passed over
the distribution today, the reference implementation saves aliases that
are alive in a node global table. When a node local alias enters the local
node over the distribution one needs to look it up in this table in order to
be able to restore it to its actual value. While aliases are passed around
locally there is no need for look-ups in this table.

The reference implementation also modifies the distribution protocol to
allow references with up to five 32-bit values. For backwards compatibility
reasons this modification of the distribution protocol cannot be used at
once
when aliases are introduced. This since we need to be able to communicate
with
older nodes from previous releases for a while. When this has been living in
the system for enough time (expected to be OTP 26) we can begin sending
references with up to five 32-bit words and remove the usage of the table
mapping references over the distribution to aliases. That is, it is not
until
this happens that the alias implementation is fully complete.

Why is it not possible to get the PID of the process that an alias refers
to?
-----------------------------------------------------------------------------

Most importantly there is no need to know the PID of the process that an
alias refers to in order to solve the problems that alias are intended
to solve. The user is expected to utilize alias in a protocol where one
knows
whether a reference is an alias or not and should not need to know the PID
of
the process that it refers to.

Besides the above there are also other issues with such functionality. The
content of a reference is just a large integer. In order to keep
distribution
transparency one would either have to specify how this integer should be
interpreted or require synchronous signaling with the node where the
identified process resides. The synchronous signal-ling will be very
expensive. By specifying how the reference integer should be interpreted we
would prevent future changes to how the integer of the reference should be
interpreted which might prevent future optimizations, improvements and new
features. Up until the time when large references with five 32-bit words can
be passed over the distribution, synchronous communication is also the only
option on how to implement such functionality.

If we should mimic the `whereis()` function of the registered name API where
you also can see if a name is currently registered, no other option than
synchronous signaling with the process identified by the alias is possible.

Why is it not possible to test if a reference is an alias?
----------------------------------------------------------

The same reason as to why it is not possible to get the PID of the
process that is referred to by an alias.

Why not allow registration of arbitrary Erlang terms instead?
-------------------------------------------------------------

Such a feature could solve the same issue that aliases are intended to
solve, but there are problems with such an approach.

Terms other than pids, ports, and references do not have a node identifier
embedded into the data type. For such data types you need some other way
to identify the node of where the name is registered. In the current case
of atoms as registered names, this is done by wrapping the name in a
two-tuple that contains the node name. Something like this is needed for
all other terms than just plain pids, ports, and references. This also
introduce a problem. Is a two-tuple just a name or a name plus a node
identifier?

Should it be possible to register a PID as a name for another process?
This would force all send operations to first lookup the PID in the
table of registered names before performing the operation. This will
cost performance in all send operations. The same is true for ports.

We don't think registration of arbitrary terms should be implemented
due to the problems that arise. Current registration feature that only
allows atoms can however be a bit too limiting when you need to register
a number of processes for the same service. An option could be to allow
registration of two-tuples containing an atom and an integer. Perhaps
other terms such as strings should also be allowed, but arbitrary terms
should not be allowed.

Allowing references as registered names implies scalability bottlenecks
not present in the alias API. That is, this would be an inferior solution
to the problem we set out to solve.

One probably wants to extend name registration with more allowed terms
than just atoms, but this for solving other problems than what aliases
are intended to solve. The name registration API does not fit aliases
so we don't see that aliases should be combined with such an extension
of the registration API. The alias solution solves the problem we set out
to solve, so this eep is limited to that.

Why is the tag option of monitor/3 introduced?
----------------------------------------------

When using the monitor option `alias` in a `spawn_request()` call you
get unnecessary delays since you cannot share the alias with the
child process until you have gotten the spawn reply with the process
identifier of the child process. You instead typically want to
explicitly create the alias before the `spawn_request()` call and pass
it as an argument to the child process.

In a typical scenario you want to receive a response or an error
of the operation. However, if you explicitly create an alias before
the `spawn_request()` operation, the monitor reference and the alias
will be different references. This will prevent the compiler from
optimizing the receive (to skip messages present in the message queue
when the reference was created) since not all receive clauses will
match on the same reference.

We solve this by using the `tag` monitor option as well as the
`reply_tag` spawn request. The following is a fully functional rpc
implementation using this method on a system with the prototype
implementation of aliases:

    rpc(Node, M, F, A) ->
        Alias = alias([once]),
        ReqId = spawn_request(Node,
                              fun () ->
                                      Result = apply(M, F, A),
                                      Alias ! {{result, Alias}, Result}
                              end,
                              [{monitor, [{tag, {'DOWN', Alias}}]},
                               {reply_tag, {spawn_reply, Alias}},
                               {reply, error_only}]),
        receive
            {{result, Alias}, Result} ->
                demonitor(ReqId, [flush]),
                Result;
            {{'DOWN', Alias}, ReqId, process, _, Error} ->
                rpc_error_cleanup(Alias, Error);
            {{spawn_reply, Alias}, ReqId, error, Error} ->
                rpc_error_cleanup(Alias, Error)
        end.

    rpc_error_cleanup(Alias, Error) ->
        case unalias(Alias) of
            true ->
                %% No flush needed since we used the once option
                %% to alias(), and the alias was still active...
                error({rpc_error, Error});
            false ->
                %% Flush a possible result message...
                receive {{result, Alias}, Result} -> Result
                after 0 -> error({rpc_error, Error})
                end
        end.

The `tag` monitor option can be used in other situations as
well in order to get a single reference that is present in
all types of responses from a group of processes. The processes
may be pre-existing or not. This reference can then be utilized
to determine if a message corresponds to a specific operation
made to a specific group of processes.

There are plans to extend the receive optimization so that multiple
receives matching on the same reference in all clauses can utilize
the optimization. This will also improve performance for such
implementations receiving multiple messages matching on the same
reference.

The tag to use in the monitor message is stored locally in the
process that sets up the monitor and does not have to be
communicated between processes. Most importantly it does not
have to be sent over the wire in the distributed case. This also
means that it can also be used when monitoring processes on older
nodes which does not support this functionality.

Backwards Compatibility
=======================

The alias feature is a pure extension, so there are no real backwards
compatibility issues.

In order to be able to communicate aliases over Erlang nodes from
previous releases we cannot pass large references over the distribution
and therefore need to keep information about aliases in a node global
table. The implementation benefits from being able to pass larger
references over the distribution, but will not do so until we can make
it mandatory to be able to handle such large references. Both OTP 24
and OTP 25 will be able to handle large references over the distribution
and since we only guarantee distribution compatibility with the two
closest releases backwards and forwards we can then make large
references mandatory in OTP 26.

This node global table for alias introduce an overhead when utilizing
aliases compared to sending using the PID of the process. This due
to allocation and manipulation of table structures. Comparing to the
existing solution of utilizing a proxy process in order to
prevent stray messages the overhead of this node global table for
aliases is small. Fortunately this node global table also only need to
be present temporarily and can be removed in OTP 26.

Reference Implementation
========================

The reference implementation is provided by
[pull request #2735](https://github.com/erlang/otp/pull/2735).

Beside implementation of the alias feature. The pull request also contain
usage of aliases in the gen behaviors such as gen_server. Due to this it is
now also possible to implement `receive_response()` functionality similar to
`erpc:receive_response()` which also have been implemented:

* `gen_server:receive_response/2`
* `gen_statem:receive_response/2`
* `gen_event:receive_response/2`

[EmacsVar]: <> "Local Variables:"
[EmacsVar]: <> "mode: indented-text"
[EmacsVar]: <> "indent-tabs-mode: nil"
[EmacsVar]: <> "sentence-end-double-space: t"
[EmacsVar]: <> "fill-column: 70"
[EmacsVar]: <> "coding: utf-8"
[EmacsVar]: <> "End:"
[VimVar]: <> "vim: set fileencoding=utf-8 expandtab shiftwidth=4
softtabstop=4:"

--
Rickard Green, Erlang/OTP, Ericsson AB
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/eeps/attachments/20201029/69c446c3/attachment-0001.htm>