[erlang-questions] term_to_binary and record improvements

Thu Aug 28 21:29:23 CEST 2008

That term_to_binary doesn't share structure leads to surprising
behaviour in the io subsystem as well (at least for somebody new to
erlang).

I was writing a toy interpreter that passes the environment around.
Function definitions are just closures kept in the environment. It has
superlexical scoping (so redefinition isn't visible to previous
definisions).

Then the io system goes into loop when printing the environment for
bigger programs.

The problem is obvious (in retrospect). The environment of closures
contains closures that close over previous environment of closures. So
even though functions just print to,

#<fun ...>

the message to io system never finishes copying the recursive structure.

This "failure" makes sense within the context of what Erlang stands
for. But it is very strange for the printing of a fairly standard
functional structure to fail to print.

Another use I've been thinking for shared-structure-aware
term-to-binary is a pretty way for having versioned data. Since Erlang
already has a bunch of functional datastructures, it would be very
cool to be able to save versions of some data in a list, then just
save that list in Mnesia as versioning history  w/o complete copying.

On 8/28/08, Joe Armstrong <erlang@REDACTED> wrote:
> I got to thinking about records and structs, and this lead me to
>  think about the behaviour of term_to_binary ..
>
>  term_to_binary has a misfeature that would cause problems in
>  implementing dynamic records.
>
>  term_to_binary does not efficiently encode shared data
>  structures. This is best illustrated by an example:
>
>  Consider this
>
>  -module(test3).
>  -compile(export_all).
>
>  test() ->
>     Big = lists:duplicate(1000,a),
>     X = {Big},
>     Y = {Big,Big,Big,Big},
>     {sizeOf(Big),sizeOf(X), sizeOf(Y)}.
>
>  sizeOf(T) -> size(term_to_binary(T)).
>
>
>  Look what happens when we run this:
>
>  1> c(test3).
>  {ok,test3}
>  2> test3:test().
>  {4007,4009,16027}
>
>  The third number in this tuple surprises me.  I had expected it to be
>  12 bytes larger than 4009. Internally Y is a pointer to four words (an
>  arity tag, with value 4), then 4 identical pointers. But the fact that
>  sizeOf(Y) is four times sizeOf(X) means that shared sub-structures in
>  Erlang terms do not become shared in the binary representation of the
>  term.
>
>  Since term_to_binary and binary_to_term are *extremely* useful and I
>  use them all the time it seems that it would be a great win to change
>  the internal representation to allow shared data structures.
>
>  Why do I want this?
>
>  I was thinking about the "record problem" - records are syntactic
>  sugar for tuples.
>
>     If we say
>
>     -record(person, {name, age}).
>     X = #person{name="fred", age=30}
>
>     Then at run-time we create a tuple {person, "fred", 30}
>
>     Unfortunately, we loose the record field information at run-time,
>  thus we can't say X.name (to access the name field of X), but have to
>  say X#person.name in our code AND we have to have the record
>  definition available at compile time.
>
>    An "easy" fix to this would be to *change* the run-time
>  representation of tuples.
>
>    Suppose the run-time representation of the above record was
>  {person, [name,age], "fred", 30} - if this were the case then the
>  fields of the tuple would be self-describing and we could let the
>  compiler turn X.name into a function call lookup(X, name).
>
>     When we update a tuple
>
>     X1 = X#person{name="sue"} we are really just doing
>
>     X1 = setelement(3, X, "sue").
>
>     So if X was {person, [name,age], "fred", 30}
>     then X1 will be {person, [name,age], "sue", 30}
>
>     The additional heap space for X1 is 5 words (for the new tuple) +
>  the space to store "sue" (perhaps not even this, since I think it's a
>  shared program literal)
>
>     The point is that the second argument of X1 is just a pointer copy
>  (internally) so even if the X1 looks long when printed it is space efficient
>  in heap storage.
>
>     It seems to me that by changing the internal representation of an
>  instance of a record(foo, {tag1,tag2,tag3})
>
>     From {foo, Val1, Val2, Val3} to
>
>          {foo, [tag1,tag2,tag3], Val1, Val2, Val3}
>
>     Would be a step in the right direction in solving the record
>     problem.
>
>      The problem is that long lists of records will have a space inefficent
>  representation if converted to binaries with term_to_binary
>
>     Comments?
>
>
>  /Joe Armstrong
>  _______________________________________________
>  erlang-questions mailing list
>  erlang-questions@REDACTED
>  http://www.erlang.org/mailman/listinfo/erlang-questions
>