[erlang-questions] data sharing is outside the semantics of Erlang, but it sure is useful

Thu Sep 17 07:52:48 CEST 2009

On Sep 17, 2009, at 1:02 PM, Jayson Vantuyl wrote:
 >I've run into this when working with a simple graph algorithm.   
 >Representing edges as {source,dest} was great for atoms and  
 >horrible for strings.  All of my tests used atoms, but at runtime,  
 >the strings were being duplicated (because I was messaging them  
 >around).  It was noticeable.

This sounds to me like a perfect example where duplication
should be avoided at the source.  Graphs should be sent as
{graph,{NodeNames},[{F1,T1},...,{Fn,Tn}]}
where the Fi and Ti are indices into the {NodeNames} tuple.
>
> Another problem I had was with a backend for the Linux Network Block  
> Device.  I was tossing around disk blocks (4k binaries) and had  
> pathological memory usage really quickly.
>
> Real development has real problems with unnecessary data  
> duplication.  This is not a matter of optimization.  Someone needs  
> to finish one of the alternate heap implementations.  Really.

There seem to be two issues confused here.
One of them is the fact that when you send a message,
all sharing within the message is removed (except that
large binaries are not supposed to be copied).

We *agree* that this is a bad thing.  My message was explicit
that Erlang should preserve sharing.

But it didn't sound as though that's what the original poster
was talking about.  I may well have misunderstood; it would
not be the first time.

It's claimed that preserving sharing would raise the cost of
message sending too high.  There's an answer to that.  Set a
modest threshold, say 100 cells or so, and try the existing
way of sending.  But if that threshold is crossed, give up,
and start over with a method that preserves within-message
sharing.