[erlang-questions] Extending term external format to support shared substructures

Tue Mar 31 03:19:17 CEST 2009

Does anyone have opinions on extending the external term format to
support shared substructures?

I think a simple way to implement this is to add two new tags:
DICTIONARY and WORD.

DICTIONARY is a special top-level-only tag (like COMPRESSED) that
indicates the term uses shared substructures.  Its encoding is:

    DICTIONARY: <<$D, Arity:32, Elements/bytes>>.

Elements is a sequence of Arity + 1 encoded terms; the first Arity
terms are stored in a dictionary for subsequent terms to reference,
and the entire binary decodes to the same as the last term in the
sequence.

WORD is a new general tag that references a term previously defined in
the dictionary.  Its encoding is:

    WORD: <<$w, Index:32>>.

Index is an offset into the dictionary; when a WORD tag is decoded, it
is replaced by a reference to the appropriate term already built in
the dictionary.  A WORD tag is not allowed to reference a term that
has not been fully built yet (to prevent circular data structures);
i.e., the nth term in the dictionary can be referenced by the (n+1)th
term in the sequence and any after it, but not within the nth term
itself or any early terms.

I'm inclined to also say it's an error to define a term in the
dictionary that is never referenced, but I can imagine reasons to
allow it, and I don't have a strong opinion either way.

I haven't thought of a good way to extend term_to_binary to support
generating such tags yet.  My best idea currently is to add an option
like {dictionary, Escape, ListOfTerms}, and then encode terms in
ListOfTerms in order as the dictionary, and to detect when encoding
{Escape, Index} and create a WORD tag instead of a tuple tag.  (I
imagine Escape would generally be a reference.)

Unless anyone is strongly opposed to the idea, I'll work on a
proof-of-concept patch.