[erlang-patches] better delete_any in gb_{sets, trees}, better difference in gb_sets

Tue Apr 17 18:22:03 CEST 2012

(ccing the list)

Hi Richard,

> As the comment below indicates, this was a design decision by Sven-Olov
> in order to keep the code straightforward. If take a tree T0 and delete
> elements to create T1, then T1 may have a suboptimal structure with
> respect to the new size of T1, but lookups will be no slower than they
> were in T0, and if you start adding elements to T1 you will soon trigger
> a rebalancing.

Sure, but that offers weaker guarantees than what you'd expect from 
balancing binary trees. Moreover, balancing will be triggered more often 
than it is actually needed (after deletions). The details are in the 
linked Arne Andersson paper ( http://user.it.uu.se/~arnea/abs/gb.html ).

> The docs do say the following:
>
> * There is no attempt to balance trees after deletions. Since
> deletions do not increase the height of a tree, this should be OK
>
> * balance/1: Rebalances Tree1. Note that this is rarely necessary,
> but may be motivated when a large number of nodes have been
> deleted from the tree without further insertions. Rebalancing
> could then be forced in order to minimise lookup times, since
> deletion only does not rebalance the tree.

True, but it doesn't clearly state that the tree does not guarantee 
logarithmic complexity for the usual operations, which is peculiar. It 
states "As with normal tree structures, lookup (membership testing), 
insertion and deletion have logarithmic complexity.". I take it that 
this means logarithmic complexity *on average*.

> That is, users are expected to know if they delete a large enough
> percentage of elements (without doing any insertions) to warrant manual
> rebalancing.

This is nontrivial to do in the general case, and the user that actually 
wants to do that will end up rolling up his own implementation of a 
rebalancing tree.

> Of course, if there's a good way of making it happen  automatically, I'm all
> for it.

There is an easy way, see Theorem 2 in the paper. There isn't an easy 
way to achieve linear 'difference', which is easy to achieve if the size 
of the subtrees are stored - or if there is, I can't see it right now.

Francesco.