[erlang-questions] Extending term external format to support shared substructures
Matthew Dempsky
matthew@REDACTED
Tue Mar 31 03:19:17 CEST 2009
Does anyone have opinions on extending the external term format to
support shared substructures?
I think a simple way to implement this is to add two new tags:
DICTIONARY and WORD.
DICTIONARY is a special top-level-only tag (like COMPRESSED) that
indicates the term uses shared substructures. Its encoding is:
DICTIONARY: <<$D, Arity:32, Elements/bytes>>.
Elements is a sequence of Arity + 1 encoded terms; the first Arity
terms are stored in a dictionary for subsequent terms to reference,
and the entire binary decodes to the same as the last term in the
sequence.
WORD is a new general tag that references a term previously defined in
the dictionary. Its encoding is:
WORD: <<$w, Index:32>>.
Index is an offset into the dictionary; when a WORD tag is decoded, it
is replaced by a reference to the appropriate term already built in
the dictionary. A WORD tag is not allowed to reference a term that
has not been fully built yet (to prevent circular data structures);
i.e., the nth term in the dictionary can be referenced by the (n+1)th
term in the sequence and any after it, but not within the nth term
itself or any early terms.
I'm inclined to also say it's an error to define a term in the
dictionary that is never referenced, but I can imagine reasons to
allow it, and I don't have a strong opinion either way.
I haven't thought of a good way to extend term_to_binary to support
generating such tags yet. My best idea currently is to add an option
like {dictionary, Escape, ListOfTerms}, and then encode terms in
ListOfTerms in order as the dictionary, and to detect when encoding
{Escape, Index} and create a WORD tag instead of a tuple tag. (I
imagine Escape would generally be a reference.)
Unless anyone is strongly opposed to the idea, I'll work on a
proof-of-concept patch.
More information about the erlang-questions
mailing list