[erlang-questions] data sharing is outside the semantics of Erlang, but it sure is useful

Tue Sep 15 00:43:11 CEST 2009

James Hague writes:
 > I've run into several cases where enforcing the sharing of data
 > resulted in a significant memory savings.  I'm talking about a
 > reduction in heap size from 60MB to under half that. By "enforcing the
 > sharing of data" I mean making sure that identical elements in a data
 > structure are actually referencing the same locations in memory.
 > 
 > This is easy to do in Erlang, because the compiler is very literal:
 > 
 >    fix_tuple({H, H}) -> {H, H};
 >    ...
 > 
 > That ensures that identical looking elements in the tuple are sharing
 > memory locations.  But there is absolutely no reason the compiler has
 > to do this.  It would be perfectly valid to optimize away the entire
 > function, just returning the original value.
 > 
 > Would any existing standard library functions make this nicer?  What I
 > really want is to have a gb_trees:insert function that returns
 > {NewTree, InsertedValue} where InsertedValue references existing data
 > (unless it wasn't already in the tree; in that case, InsertedValue is
 > exactly what I passed in).  Then I can happily use InsertedValue,
 > knowing data is being shared.

Sounds like you want "hash consing". A hash table keeps track of all
non-immediate terms seen so far. To "intern" a new term you recurse
down to the leaves and compute hashes, on the way up you check if an
equivalent node (e.g. cons/2 or tuple/N) has been seen, and if so
you use that one otherwise you add the new node to the hash table.
Either way you return the interned node and its hash on the way up.

This may lose sharing with the input term, but interned terms will
share everything's that's sharable.

Erlang's default memory model doesn't allow same-node processes to
share memory(*), so you will lose sharing in message sends.

(*) Except in one binaries-related case which is irrelevant here.

A major downside of hash consing is that it can leak memory:
if an interned term becomes unreferenced in the application, the
hash table will still keep a master copy of it, wasting memory.
VMs with built-in support for hash consing usually also support
"weak references" or "weak hashes" where the referenced data can
be nuked if the GC determines that it should be dead.