[erlang-questions] term_to_binary and record improvements

Thu Aug 28 20:47:25 CEST 2008

I got to thinking about records and structs, and this lead me to
think about the behaviour of term_to_binary ..

term_to_binary has a misfeature that would cause problems in
implementing dynamic records.

term_to_binary does not efficiently encode shared data
structures. This is best illustrated by an example:

Consider this

-module(test3).
-compile(export_all).

test() ->
    Big = lists:duplicate(1000,a),
    X = {Big},
    Y = {Big,Big,Big,Big},
    {sizeOf(Big),sizeOf(X), sizeOf(Y)}.

sizeOf(T) -> size(term_to_binary(T)).

Look what happens when we run this:

1> c(test3).
{ok,test3}
2> test3:test().
{4007,4009,16027}

The third number in this tuple surprises me.  I had expected it to be
12 bytes larger than 4009. Internally Y is a pointer to four words (an
arity tag, with value 4), then 4 identical pointers. But the fact that
sizeOf(Y) is four times sizeOf(X) means that shared sub-structures in
Erlang terms do not become shared in the binary representation of the
term.

Since term_to_binary and binary_to_term are *extremely* useful and I
use them all the time it seems that it would be a great win to change
the internal representation to allow shared data structures.

Why do I want this?

I was thinking about the "record problem" - records are syntactic
sugar for tuples.

    If we say

    -record(person, {name, age}).
    X = #person{name="fred", age=30}

    Then at run-time we create a tuple {person, "fred", 30}

    Unfortunately, we loose the record field information at run-time,
thus we can't say X.name (to access the name field of X), but have to
say X#person.name in our code AND we have to have the record
definition available at compile time.

   An "easy" fix to this would be to *change* the run-time
representation of tuples.

   Suppose the run-time representation of the above record was
{person, [name,age], "fred", 30} - if this were the case then the
fields of the tuple would be self-describing and we could let the
compiler turn X.name into a function call lookup(X, name).

    When we update a tuple

    X1 = X#person{name="sue"} we are really just doing

    X1 = setelement(3, X, "sue").

    So if X was {person, [name,age], "fred", 30}
    then X1 will be {person, [name,age], "sue", 30}

    The additional heap space for X1 is 5 words (for the new tuple) +
the space to store "sue" (perhaps not even this, since I think it's a
shared program literal)

    The point is that the second argument of X1 is just a pointer copy
(internally) so even if the X1 looks long when printed it is space efficient
in heap storage.

    It seems to me that by changing the internal representation of an
instance of a record(foo, {tag1,tag2,tag3})

    From {foo, Val1, Val2, Val3} to

         {foo, [tag1,tag2,tag3], Val1, Val2, Val3}

    Would be a step in the right direction in solving the record
    problem.

     The problem is that long lists of records will have a space inefficent
representation if converted to binaries with term_to_binary

    Comments?

/Joe Armstrong