[erlang-questions] Data structures are strange things

Fri Jun 8 06:24:05 CEST 2012

"I don't know why the straight multi-size versions are faster but my
guess is that the pattern-matching/data construction is faster on the
flat bigger nodes rather than on the nested pairs...."

Just a guess here (but at least a guess informed by an obsession with
search tree data structures, several decades ago):

Before you go looking for the effects of a reduction of space,
remember that the conventional time complexity analysis for operations
on such data structures is predicated on destructive updates. An O(log
n) time complexity for delete or update represents the *worst case*.
Some operations will result in no rebalancing at all, just the
addition or removal of a node.

Without destructive updates, you always need to rebuild/rewrite tree
nodes from the point of update up to the root. Even if nothing
substantive changes along the way. It'll cost ya. In space, too.

How can you reduce these costs? By systematically reducing the
distance to the root. You can do that by choosing a structure with a
higher (average) branching factor. The higher the branching factor of
your search structure, the shallower it will be, and the fewer
node-allocate-and-rewrite operations you'll need to do on your way
from the updated node to the root. (This probably reduces GC costs as
well, though I expect that would be a second-order effect.)

Increasing the branching factor might have diminishing returns after
some point, but I doubt that a branching factor of 4 is anywhere near
that point. 2-3 trees, and 2-3-4 trees, have higher (average)
branching factors than binary trees. Leaning RB trees are binary, but
there might be something in how they lean that effectively reduces the
average distance from updated node to root. I don't know anything
about them.

"P.S. This is the reason why there was hoped to be many different dict
modules with the same API but different data structures. So as it to
make it easier to choose the most suitable version for each occasion.
There is no generally "best" choice, even among trees."

Sign me up for that.

-michael turner

On Fri, Jun 8, 2012 at 10:32 AM, Richard O'Keefe <ok@REDACTED> wrote:
> If you take a look at a typical 2-3-4 tree implementation in Java
>    http://www.cs.berkeley.edu/~jrs/61bs02/hw/hw7/dict/Tree234.java
>>>  http://www.cs.berkeley.edu/~jrs/61bs02/hw/hw7/dict/Tree234Node.java
> you'll see nodes containing
>
>        an integer saying how many keys there are
>        three keys
>        four links to children nodes
>        a link to a parent node
>
> Allow one word for "what is the class of this object" and that's
> 10 words per node.
>
> Using Erlang tuples, we'd have
>        {K1,C0,C1}                      4 words
>        {K1,K2,C0,C1,C2}                6 words
>        {K1,K2,K3,C0,C1,C2,C3}          8 words
> WAM-based Prolog systems would take the same space;
> SML/NJ should be able to shave a word off each (using the bottom
> 2 bits of a pointer to say which of the three cases applies).
>
> In a functional language, when you make a change you create new nodes,
> so no node ever has to change size/shape.  You _could_ do that in Java
> too, but people often don't think of it.
>
> It seems pretty clear that the Prolog/Erlang/SML version of a 2-3-4
> tree is going to take *less* space than a binary tree in the same
> language.  Since in these languages there is often a strong correlation
> between space and time, I find Robert Virding's results pleasant but
> not that surprising.
>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions