[erlang-questions] Cost of Copy

Fri May 28 18:00:40 CEST 2010

As others have pointed out data in Erlang is defined to immutable.
This implies that if you wish to do destructive operations then you
have to be sure they are not seen anywhere. In general this can be
quite difficult as it is often not trivial to see if a data structure,
or a portion of it, is shared. The compiler is very conservative about
this and IIRC only does it when there is a sequence of setelement/3
where the temporaries are not exported and nothing is done between
them. You typically get this when you do a single multiple record
update, for example

A1 = A0#foo{bar=57,baz="Robert",zip=now()}

It is better if you can combine record updates like this instead of
doing them one-by-one.

I think it is easiest to view a complex data structure as a tree with
generally different sized nodes. If you modify one of the nodes then
the only other nodes you will have to rewrite are those nodes between
the root of the tree and the modified node. Unless of course you are
doing something special along the way. Or you have botched your code.
:-)

What sized nodes depends entirely upon what you are doing, the only
general hint I can give is use that which fits best into your
application. I mean it, really. Don't go complicating matters by
picking weirdo data structures. So if you want a dynamic sequence then
a list is usually the best, though not always. If you have 7 things
which go together then put them in one 7 element tuple or in a record.
I would only really start worrying about splitting my tuples if they
get really big. Or if you feel that grouping them is what feels best.

The only time I would seriously start thinking about what size thing
to use is if you are writing more general library type packages.
Generally having smaller nodes but a deeper resulting tree will lead
to less rewriting, but as the tree is deeper it will cost more to
traverse it and though the rewrites will be smaller there will be more
of them. The opposite for wider flatter trees. It also depends on how
you select which subnode to go to, this is, of course, very algorithm
dependent. So in the array module each node contains 10 slots, IIRC,
but here it is easy to select the right one using its index. Dicts
also use indices and are wider. For more general search trees it
becomes more complex and binary trees are popular, but I would say
that it is more the algorithms which decide this.

A list is really a degenerate binary tree.

It all depends. This is clearly optimization and I would read
http://en.wikipedia.org/wiki/Program_optimization "When to optimize"
before getting to deep into this.

Robert

On 27 May 2010 22:09, Henning Diedrich <hd2010@REDACTED> wrote:
> Thanks a lot James!
>
> am I right then to conclude that mutating something in a list will be the
> faster the more the data is spread out deeper over several levels?
>
> A = { B, C, D, E, F, G, H }
>
> will be slower to change than, say
>
> A = { { B, C }, { D, E },  {F, G, H } }
>
> because if I change e.g. B, only { B', C } will be created as new  internal
> list of pointers, as opposed to the whole of { B', C, D, E, F, G, H }. Which
> should for larger lists and tuples, and frequent changes, start to make a
> difference?
>
> So this would be the fastes option?
>
> A = { { B }, { C }, { D }, { E },  { F }, { G }, { H } }
>
> And if the tuple may be expected to change in size (elements not only
> changeing but being added/deleted), this:
>
> A = { { { { B }, { C } }, { { D }, { E } } }, { { { F }, { G } }, { { H } }
> } }
>
> Maybe I am wrong that there may be a difference between tuples and lists,
> but is it  the same for
>
> A = [ B, C, D, E, F, G, H ]
>
> slower to change than
>
> A = [ { B, C }, { D, E },  { F, G, H } ]
>
> ? Or are list-elements changed faster, in general, virtually 'in-place' as
> only one pointer is replaced?
>
> Hardly I guess, because the list itself must not be changed? So the whole
> new list must be copied anyway, just like a tuple.
>
> Thanks for your take,
> Henning
>
>
> James Hague wrote:
>>
>> If you ignore message passing, then the copying/sharing issues in Erlang
>> are
>> just about the same as in Lisp and Scheme, so you may find some
>> information
>> that way.
>>
>> The short version is that some values are represented directly in a memory
>> "cell," like atoms and (most) integers. Others are pointers to data stored
>> elsewhere (list elements, or conses, which are two consecutive cells in
>> memory) or tuples (N consecutive cells plus one for the size). Just by
>> looking at code you can see where conses and tuples are created. If you
>> use
>> an existing "pointer" to data, then that data is not copied.
>>
>> Examples:
>>
>> B = {A, A, A}
>> This creates a three element tuple (four memory cells). It doesn't matter
>> if
>> A is a list of a million elements or a tuple or the number 2. B just
>> contains three cell-sized values (plus the size of the tuple).
>>
>> X = [57|List]
>> This creates a single cons (2 cells). The first element is 57. The second
>> is
>> a pointer to List. So even though you've got two lists (X and List), all
>> of
>> the data is shared except for the first cons of X.
>>
>>
>
>