Manipulating Data Structures

Fri Jun 2 16:23:17 CEST 2000

On Fri, 2 Jun 2000, Eberhard Lutz wrote:

>Hello together,
>
>I hope you forgive this newbie question from a C++, Perl, etc.
>geek who is willing to convert to Erlang in order to set up a
>fault-tolerant and distributed system without having to write
>base classes for the next 5 years ....

A noble ambition...

>The question is about data structures and the copy overhead
>of their modification.
>Let's say you have in Perl something like
>        $record = {
>            "language" => 'Java',
>            "status" => 'overhyped',
>            "usage" => 'with care',
>            "samples" => [ $prog1, $prog2 ]     %% each $progX is a 200kByte data chunk
>       }
>Now, how do I most effeciently do something like
>        $record->{status} = "dead".
>
>So to put it simple: how do I update large data structures without
>running into useless copy operations of the whole structure.

Well, first of all, you *will* copy the whole record if you update one
attribute. In Erlang syntax, you'd want to come up with a logical name
for $record (other than "record"):

-record(language, {name,
                   status,
                   usage,
                   samples}).

And then create your instance like this:

Java = #language{name = java,
                 status = overhyped,
                 usage = carefully,
                 samples = [Sample1, Sample2, ...]}.

Now, changing status from overhyped to dead would copy the record, 
BUT the actual record structure is only a set of pointers.
Physically, the record is stored as a tuple (for now):

{language, java, overhyped, carefully, [Sample1, Sample2, ...]}.

The tags that start with lowercase are atoms (essentially labels).
They are a 4-byte pointer into an atom table. The list is a 4-byte
pointer to a cons cell (a pointer to a data object + a pointer to the
next cons cell -- or is the cons cell stored directly in the record? I
don't remember.) Basically, I believe that the record is an array of
pointers to the data objects (in your case 5, including the record
tag.) In other words, 20 bytes will be copied when you change the
status attribute.

Thus, the cost of updating a record is roughly proportional to the
number of other elements in the record -- not their size. 

However, if you'd store the updated record in ETS or mnesia
afterwards, the whole record will be copied. As the brunt of the data
in your record resided in the 'samples' element, you could break that
out, and store them in a separate ETS table. That way, you wouldn't
have to copy them every time you read/write the record to ETS.

>I assume that Erlang is handling these matters as elegantly as it
>handles the rest of the world, but being suspicious by nature it
>would make me sleep better if I get tons of replies like 'Hey silly
>newbie, this is simple ....'.

Well, Erlang probably doesn't do everything it could to reduce the
copying overhead. I'm sure it would be possible for the compiler to
perform destructive update behind the scenes in many cases.

How badly you'll be hurt by the copying overhead depends largely on
what you're trying to implement. I once wrote a few variations on
Dijkstra's "shortest path first" algorithm, and then we implemented
the same algorithm in C -- C was invariably 50-100 times faster.
In that particular case, the performance difference turned out not to
be that critical after all, since the frequency of updates was pretty
small. 

You will have to make up your own mind about what's the best balance
between productivity (=Erlang), fault-tolerance (=Erlang), readability
(=Erlang) and raw speed (=C, if the problem is simple enough,
otherwise, possibly a toss-up, or maybe even Erlang). But make sure
you measure real performance, and not just try to predict how it will
perform. It's very difficult to try to calculate what the VM will do
and how much it will cost. We've seen this a number of times.

If you want to build a fault-tolerant distributed system, write a
prototype in Erlang. I can almost promise you that you'll be able to
meet your performance criteria in, if not all aspects, at least 90% of
them. Then figure out where your bottlenecks are, try to understand
what makes them bottlenecks, and then decide what to do with them
(switch algorithms? switch to C?) You'll find some experienced Erlang
hackers on this list to discuss your options with.

/Uffe
-- 
Ulf Wiger                                    tfn: +46  8 719 81 95
Network Architecture & Product Strategies    mob: +46 70 519 81 95
Ericsson Telecom AB,              Datacom Networks and IP Services
Varuvägen 9, Älvsjö,                    S-126 25 Stockholm, Sweden