[erlang-questions] term_to_binary and record improvements

Thu Aug 28 22:07:30 CEST 2008

It seems to me that your mail contains two proposals:

1) A better way of working with records in Erlang
2) A more efficient way to encode an Erlang term into a binary, which
   might be needed to make 1) work efficiently.

ad 1) 
I am *very* new to Erlang, and the way Erlang approaches records
surprised me a bit, but I can appreciate it as a pragmatic solution to a
real-world problem. Your solution would allow accessing record fields
without explicitly specifying the type of the record at every access, at
the cost of storing (a pointer to) required type information in every
tuple-that-is-actually-a-record. I see three obvious drawbacks to this:

- It opens up the possibility of "hacky" code, with people constructing
  records in novel and dynamic ways. On the other hand, that is what
  dynamically typed languages are for, and real programmers can write
  assembler in any language.

- It is slightly less space efficient, but probably negligible.

- It breaks backward compatibility, although you can probably implement
  this in the compiler/VM by analyzing the tuple and deciding on whether
  we are dealing with a new or an old-style record-tuple.

Which sort of prompts the question what the original problem is that
you're trying to solve and whether this is worth it. An advantage that I
see is that it would allow for more generic code, for instance
referencing to X.name, which would work for every record which has a
field called "name", without having to specify the type information. This
could be used to implement some sort of "inheritance" in records. You
might want to extend the record definition language to make this
possible.

ad 2)
A more space efficient binary for a term is Good(tm). It's just a matter
of contract: What is the contract between term_to_binary and its user?
If the only contract is that a binary that is unmodified can be exploded
into a term again, then the more space efficient binary is just a better
implementation of the BIF. However if people expect to be able to pick a
binary apart, change it, and then explode it into a term again then the
new implementation would break the contract, and a term_to_binary2 is
called for with an updated contract.

My 2 eurocent.

++Jos.ch

On Thu, Aug 28, 2008 at 08:47:25PM +0200 it came to pass that Joe Armstrong wrote:
> I got to thinking about records and structs, and this lead me to
> think about the behaviour of term_to_binary ..
> 
> term_to_binary has a misfeature that would cause problems in
> implementing dynamic records.
> 
> term_to_binary does not efficiently encode shared data
> structures. This is best illustrated by an example:
> 
> Consider this
> 
> -module(test3).
> -compile(export_all).
> 
> test() ->
>     Big = lists:duplicate(1000,a),
>     X = {Big},
>     Y = {Big,Big,Big,Big},
>     {sizeOf(Big),sizeOf(X), sizeOf(Y)}.
> 
> sizeOf(T) -> size(term_to_binary(T)).
> 
> 
> Look what happens when we run this:
> 
> 1> c(test3).
> {ok,test3}
> 2> test3:test().
> {4007,4009,16027}
> 
> The third number in this tuple surprises me.  I had expected it to be
> 12 bytes larger than 4009. Internally Y is a pointer to four words (an
> arity tag, with value 4), then 4 identical pointers. But the fact that
> sizeOf(Y) is four times sizeOf(X) means that shared sub-structures in
> Erlang terms do not become shared in the binary representation of the
> term.
> 	
> Since term_to_binary and binary_to_term are *extremely* useful and I
> use them all the time it seems that it would be a great win to change
> the internal representation to allow shared data structures.
> 
> Why do I want this?
> 
> I was thinking about the "record problem" - records are syntactic
> sugar for tuples.
> 
>     If we say
> 
>     -record(person, {name, age}).
>     X = #person{name="fred", age=30}
> 
>     Then at run-time we create a tuple {person, "fred", 30}
> 
>     Unfortunately, we loose the record field information at run-time,
> thus we can't say X.name (to access the name field of X), but have to
> say X#person.name in our code AND we have to have the record
> definition available at compile time.
> 
>    An "easy" fix to this would be to *change* the run-time
> representation of tuples.
> 
>    Suppose the run-time representation of the above record was
> {person, [name,age], "fred", 30} - if this were the case then the
> fields of the tuple would be self-describing and we could let the
> compiler turn X.name into a function call lookup(X, name).
> 
>     When we update a tuple
> 
>     X1 = X#person{name="sue"} we are really just doing
> 
>     X1 = setelement(3, X, "sue").
> 
>     So if X was {person, [name,age], "fred", 30}
>     then X1 will be {person, [name,age], "sue", 30}
> 
>     The additional heap space for X1 is 5 words (for the new tuple) +
> the space to store "sue" (perhaps not even this, since I think it's a
> shared program literal)
> 
>     The point is that the second argument of X1 is just a pointer copy
> (internally) so even if the X1 looks long when printed it is space efficient
> in heap storage.
> 
>     It seems to me that by changing the internal representation of an
> instance of a record(foo, {tag1,tag2,tag3})
> 
>     From {foo, Val1, Val2, Val3} to
> 
>          {foo, [tag1,tag2,tag3], Val1, Val2, Val3}
> 
>     Would be a step in the right direction in solving the record
>     problem.
> 
>      The problem is that long lists of records will have a space inefficent
> representation if converted to binaries with term_to_binary
> 
>     Comments?
> 
> /Joe Armstrong
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://www.erlang.org/mailman/listinfo/erlang-questions

-- 
What cannot be shunned must be embraced. That is the Path...
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 188 bytes
Desc: not available
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20080828/295cc5b1/attachment.bin>