[erlang-questions] data sharing is outside the semantics of Erlang, but it sure is useful

Thu Sep 17 04:46:01 CEST 2009

Dnia 2009-09-17, czw o godzinie 11:48 +1200, Richard O'Keefe pisze:
> On Sep 17, 2009, at 4:43 AM, Jayson Vantuyl wrote:
> > I don't really think that it's useful to classify data sharing as  
> > data compression.
> 
> It's very much in the spirit of dictionary-based compression.
> 
> >  Bottom line, there's an optimization (and a clearly important one)  
> > that Erlang isn't doing.
> 
> And can't reasonably be *expected* to do.  It is reasonable
> to expect Erlang to *preserve* sharing, as when sending a term
> to another process, because failing to do so can make space use
> blow up in a rather sickening way which it's hard for a
> programmer to detect.
> 
> I sometimes think that for every use case there is an equal
> and opposite use case.  In the case of memory, for example,
> we've got *space* issues and *cache* issues.  Looking for
> existing copies of stuff can save you space, but it can
> do terrible things to your cache (bringing in stuff that it
> turns out you don't want).  The tradeoffs depend on how much
> space you may save, how likely the saving is, and how well you
> can avoid looking at irrelevant stuff while looking for an
> existing copy.  The programmer is in a better position to know
> these things than the Erlang compiler or runtime system.
> 
> One thing I didn't quite understand was why the original data
> source is emitting stuff with lots of duplication in the first
> place.  Fixing the duplication problem at the source has the
> added benefit of reducing the cost of getting the data into an
> Erlang process to start with.
> 

There is also another good aspect of preserving sharing.

If we have X = Y = something(), then testing for equality X == Y
is just simple testing (in VM) for pointer first, if they match return
true, if not then test it deeper, if needed, also using this rule, and
eventually in the end use plain value.

This can speed up many things.

It is quite strange to anybody, that for example time complexity 
(and memory also as sharing is lost) of algorithm are different in
situations:

1) we built data structure D, and execute algorithm a(D)
2) we build data structure D, send it to P, which then executes a(D).

Difference between them can be very very big factor, and can depend
on the size of D. In pathological situation your algorithm can change
for example from O(n^2) to O(n^4). And i can imagine even bigger
problems than that.

PS. Beyond the fact that message passing and term_to_binary should be
aware of sharing and preserve as much as possible from it (eventualy
with some optional flag), i am also interested with some experimental
hybrid heap approach in erlang.

-- 
Witold Baryluk
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: To jest cz??? wiadomo?ci podpisana cyfrowo
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20090917/e5586802/attachment.bin>