Cost of Copy

Sat May 8 09:43:00 CEST 2010

1

I am trying to get a better picture of what the performance costs are 
when mutating (collections of) complex structures.

Most generally, if I have a list of structures and want to make one 
change to one of the structures; and if I want to avoid ETS for the 
implicit copying, what else should I NOT do to not run into the same 
problem with a non-ETS approach, and how CAN I have as little 
unnecessary internal copying of all those things that stay the same.

I just ran into this post by J Louis while looking for information

http://jlouisramblings.blogspot.com/2009/01/common-erlang-misconceptions.html

It was giving some implicit answers about what results in internal 
copying, what doesn't. And I am not talking about strings here.

I wondered if there is a dedicated write up out there that explains a 
little more to the point what the use of lists, sets, trees, ets and 
mnesia implies in terms of copying, to better understand the performance 
implications. With immutable variables, of course there is always the 
hope that underneath, not much copying is actually necessary, as I 
understand e.g. lists are benefiting hugely from, to become O(1) when 
adding heads to them. So assumptions based on ALGOL heritage experiences 
should fail. Which puts one on a quest like Max Lapshin's  
http://erlang.2086793.n4.nabble.com/Help-with-storing-data-in-memory-td2119208.html#a2119208 
who is also really looking for a way to avoid implicit copying.

I have been looking for clues in this direction all the time but found 
them rather scarce in the docs. May be me. May be an idea to put more 
focus on in erldocs, erlang.org etc? I am sure there are links. Maybe 
they could be featured more prominently. I am sure there are many people 
coming to Erlang with a deeply ingrained eye on performance and would 
like to understand these implications earlier in the learning experience.

I mean things exactly like

"changing status from x to y would copy the record, BUT the actual 
record structure is only a set of pointers." by Ulf Wiger in his post

http://www.erlang.org/cgi-bin/ezmlm-cgi/4/1342 (var names changed)

2

 From that post I understand that the best performance should come from 
a data structure that is as nested as possible, because that will reduce 
the amount of pointers that are copied with every mutation?!

 As the mutation copies not-changed data only on one level. That's a 
very interesting implication that so far I haven't found spelled out 
anywhere (but would have liked to learn up front, if true). I am still 
not sure if it would hold across the board. Or is it wrong for other 
reasons I am not aware of? (Or could this post maybe for any strange 
reason be ... outdated? It's a decade old in June. Wow!)

3

Surveying the choices, does this imply I am to use a copy-heavy approach 
if I want to have meaningful transaction support between processes? 
Because the choice there is ets, dets or mnesia and they all copy the 
entire structure everytime there is any access to it?

4

I would encourage a view that these kind of questions would be best 
discussed quite early in the learning curve. Together with the 
implications that immutable state has for message passing and how it 
saves so much copying in that case. Or so I speculate. I am not speaking 
about the error prevention aspects here.

That said, thanks for all the efforts that are going into making the 
Erlang docs ever better! Even Google today adopted the classic format of 
the Erlang doc frame layout I realize. But seriously, thanks!

Henning