[erlang-questions] RFC: expose dist protocol's term_to_binary/2 atom cache

Tue Jul 10 22:49:48 CEST 2018

Right now, the distribution protocol gets to play with toys that the rest
of us don’t: it can make sequential term_to_binary/1 calls more
space-efficient by lifting the atoms out of each term, converting them into
cheap EXT_ATOM_CACHE_REFs. It then emits the atom-cache updates as headers
in each data message.

I propose moving this logic to a more Erlang-userland accessible level.
Rather than atom caches being something that happen internally in dist.c,
atom caches could be exposed as a general feature of the External Term
Format in external.c.

Users could create new atom caches through a BIF (e.g. make_atom_cache/0)
and then fold the terms they wish to encode/decode over the atom cache,
which would either return a new atom cache (if the atom cache is an exposed
term) or would mutate the underlying data-structure of the atom cache (if
the atom cache is an opaque handle.) Users could then use another BIF to
ask the atom cache to create a binary description of an update.

The distribution protocol could then be reimplemented backwards-compatibly
in terms of these new BIFs, where the atom cache update in each data
message's header would simply be the return value of the "describe atom
cache update" BIF.

I've outlined below a few potential API designs for this idea. I'd like to
get feedback what people like/dislike about them. If there's clear
community consensus that one of these ideas is good, I'll move forward with
either a patch or an EEP (I personally feel this isn't the kind of big-deal
change that warrants an EEP, but correct me if I'm wrong.)
1 - Erlang-exposed atom cache

Using the hypothetical BIF make_atom_cache/0, a hypothetical atom_cache
option for term_to_binary/2, and the hypothetical BIF binary_to_term/2
which also takes an atom_cache option. Both term_to_binary/2 and
binary_to_term/2, when passed the atom_cache option, would return a {Result,
UpdatedAtomCache} tuple.

terms_to_binaries(Terms) ->
    terms_to_binaries(Terms, erlang:make_atom_cache()).
terms_to_binaries(Terms, ACache0) ->
    terms_to_binaries(Terms, ACache0, ACache0, []).
terms_to_binaries([], ACache0, ACache0, Bins) ->
    lists:reverse(Bins).
terms_to_binaries([], ACache0, ACacheFinal, Bins) ->
    ACacheDiff = erlang:diff_atom_cache(ACache0, ACacheFinal),
    ACacheDiffBin = erlang:term_to_binary(ACacheDiff),
    [ACacheDiffBin | lists:reverse(Bins)].
terms_to_binaries([Term | Terms], ACache0, ACachePrev, Bins) ->
    {Bin, ACacheCurr} = erlang:term_to_binary(Term, {atom_cache, ACachePrev}),
    terms_to_binaries(Terms, ACache0, ACacheCurr, [Bin | Bins]).
binaries_to_terms(Bins) ->
    binaries_to_terms(Terms, erlang:make_atom_cache()).
binaries_to_terms(Bins, ACache0) ->
    binaries_to_terms(Bins, ACache0, []).
binaries_to_terms([], ACache, Terms) ->
    {lists:reverse(Terms), ACache}.
binaries_to_terms([Bin | Bins], ACache0, Terms) ->
    {Term, ACache1} = erlang:binary_to_term(Bin, [{atom_cache, ACache0}]),
    binaries_to_terms(Bins, ACache1, [Term | Terms]).

2 - Exposed atom cache, but opaque transaction handles

Design #1 assumes you want functional-data-structure semantics for atom
caches the whole way through the encoding/decoding process, which can be
slow, and which really makes little sense given that there is no real use
for the intermediate atom-cache representations.

An alternate design would make the intermediate atom-cache representations
opaque—you’d have a handle into an atom-cache “transaction” (a piece of
opaque ERTS-mutable data in the owner process’s heap, like a NIF handle),
and committing the transaction would destroy this handle and return a
regular Erlang-visible representation of the updated atom-cache and a diff
from the start of the transaction:

terms_to_binaries(Terms) ->
    terms_to_binaries(Terms, erlang:make_atom_cache()).
terms_to_binaries(Terms, ACache) ->
    terms_to_binaries(Terms, atom_cache:transaction(ACache), []).
terms_to_binaries([], ACacheTx, Bins) ->
    {_UpdatedACache, ACacheDiff} = atom_cache:commit(ACacheTx),
    ACacheDiffBin = erlang:term_to_binary(ACacheDiff),
    [ACacheDiffBin | lists:reverse(Bins)].
terms_to_binaries([Term | Terms], ACacheTx, Bins) ->
    Bin = erlang:term_to_binary(Term, {atom_cache, ACacheTx}),
    terms_to_binaries(Terms, ACacheTx, [Bin | Bins]).
binaries_to_terms(Bins) ->
    binaries_to_terms(Terms, erlang:make_atom_cache()).
binaries_to_terms(Bins, ACache) ->
    binaries_to_terms(Bins, atom_cache:transaction(ACache), []).
binaries_to_terms([], ACacheTx, Terms) ->
    {UpdatedACache, _ACacheDiff} = atom_cache:commit(ACacheTx),
    {lists:reverse(Terms), UpdatedACache}.
binaries_to_terms([Bin | Bins], ACacheTx, Terms) ->
    Term = erlang:binary_to_term(Bin, [{atom_cache, ACacheTx}]),
    binaries_to_terms(Bins, ACacheTx, [Term | Terms]).

This still introduces a binary_to_term/2 BIF, but no longer requires
term_to_binary/2 or binary_to_term/2 to have multiple potential return
types. It also would require the new BIFs atom_cache:transaction/1 and
atom_cache:commit/1, as well as the make_atom_cache/0 BIF.
3 - Opaque atom cache (with or without transactions)

This overhead savings is meaningful, but potentially small if the client is
using term_to_binary/2 for a streaming protocol, where the atom-cache
transactions would have to be limited to small batch sizes in order to emit
chunks in a timely manner.

Further savings could be made if, rather than the atom cache being exposed
as a term between transactions, the atom cache itself was an opaque handle
for its entire lifetime. For example, an atom cache could exist as an ETS
table using the special table-type atom_cache, with its own backing
data-structure optimized for this use-case.

This would handily provide a framework for shared-access concurrency of
atom caches—concurrent decoder processes and even concurrent encoder
processes could be backed by a shared atom cache, allowing terms from the
same port, or destined for the same port, to be encoded/decoded
concurrently, without generating redundant atom-cache-update binaries in
the encoded stream.

Transactions, if still desirable, would still need to exist as separate
opaque state-tracking objects, in order for the transaction-commits to
produce valid atom-cache diffs corresponding to the terms that have been
encoded.

But, if updates to the atom_cache data structure are cheap enough,
transactions may able to be discarded as an approach in favour of each
individual term_to_binary/2 and binary_to_term/2 with the atom_cache option
passed performing an atomic write to the atom cache table, and returning an
individual-operation diff.

If, under an atomic-updates design, term_to_binary/2 simply prepended the
encoded atom-cache diff to its own encoded term, then this would make the
APIs of the two ETF functions entirely backwards-compatible with their
forms today, just now with each having side-effects on an optional passed
atom-cache and potentially different outputs (in content, but not in type)
when the atom_cache option is passed.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20180710/c97b998b/attachment.htm>