[erlang-questions] Extending term external format to support shared substructures

Jachym Holecek freza@REDACTED
Tue Mar 31 10:43:20 CEST 2009


Hello,

# Matthew Dempsky 2009-03-31:
> Does anyone have opinions on extending the external term format to
> support shared substructures?

I was thinking it would be a good idea to have such option.

> I think a simple way to implement this is to add two new tags:
> DICTIONARY and WORD.
> 
> DICTIONARY is a special top-level-only tag (like COMPRESSED) that
> indicates the term uses shared substructures.  Its encoding is:
> 
>     DICTIONARY: <<$D, Arity:32, Elements/bytes>>.
> 
> Elements is a sequence of Arity + 1 encoded terms; the first Arity
> terms are stored in a dictionary for subsequent terms to reference,
> and the entire binary decodes to the same as the last term in the
> sequence.
> 
> WORD is a new general tag that references a term previously defined in
> the dictionary.  Its encoding is:
> 
>     WORD: <<$w, Index:32>>.

Take a look at Apple binary plist encoding for inspiration -- rather
similar to what you propose, except there's no need for 'Arity' and
that compound objects encoded this way are only allowed to reference
WORD subterms. It is then up to the encoder to decide how much effort
he puts into subterm uniquification. For plists it was a good idea
to uniquify basic types (atoms, strings, numbers, booleans) but not
compound types (array, dictionary) as that would be too costly.

> Index is an offset into the dictionary; when a WORD tag is decoded, it
> is replaced by a reference to the appropriate term already built in
> the dictionary.  A WORD tag is not allowed to reference a term that
> has not been fully built yet (to prevent circular data structures);
> i.e., the nth term in the dictionary can be referenced by the (n+1)th
> term in the sequence and any after it, but not within the nth term
> itself or any early terms.
> 
> I'm inclined to also say it's an error to define a term in the
> dictionary that is never referenced, but I can imagine reasons to
> allow it, and I don't have a strong opinion either way.

What would be the reasons to support it? No strong opinion either,
just curious.

> I haven't thought of a good way to extend term_to_binary to support
> generating such tags yet.  My best idea currently is to add an option
> like {dictionary, Escape, ListOfTerms}, and then encode terms in
> ListOfTerms in order as the dictionary, and to detect when encoding
> {Escape, Index} and create a WORD tag instead of a tuple tag.  (I
> imagine Escape would generally be a reference.)

I must be missing something. What prevents you from making this simply

  term_to_binary(Term, [dictionary])

? Also, the name 'dictionary' is somewhat confusing IMO, something
like 'unique' would perhaps be more clear. Maybe '{unique, Types}'
so that you can control which of basic types you want uniquified.

Regards,
	-- Jachym



More information about the erlang-questions mailing list