[erlang-patches] better delete_any in gb_{sets, trees}, better difference in gb_sets

Mon Apr 16 23:52:31 CEST 2012

Hi,

I have found some problems with the general balanced trees/sets code.

The first one is the fact that 'delete_any' does twice the work it needs 
to be, since it first checks for membership and then uses 'delete'. 
Obviously we only need one traversal, you can find a patch here: 
https://github.com/bitonic/otp/compare/gb_delete_any .

The second one is regarding the performance of 'difference' and 'is_subset'.

The gb implementation used in otp does not rebalance on deletion, and 
does rebalancing in a quite arbitrary way in general. The implementation 
does not guarantee logarithmic costs for lookups/insertion/removals, 
both as an upper bound or amortized. It would be very easy to fix that 
(the paper on gb trees itself offers a solution 
http://user.it.uu.se/~arnea/abs/gb.html with almost no overhead). I 
could patch it if I knew I had good chances of getting the patches 
integrated, but I don't want to risk losing time.

In other words, it is very easy to end up with a degenerate tree that 
looks like a linked list, with enough well placed deletions. In 
practice, this is not a big problem, since the height of the three when 
inserting will be bounded logarithmically anyways, so it's impossible to 
end up in tragic situations.

First of all, this fact should be well documented in the docs. I surely 
wasn't expecting that, and I had to look at the source to find out.

More importantly, even more unpredictably, 'difference' is O(n*log(n)) 
on the size of the set we're subtracting from, with dreadful consequences:

     S = gb_sets:from_list(lists:seq(1, 1000000)).
     {T1, _} = timer:tc(gb_sets, difference, [S, 
gb_sets:from_list([500000])]).
     {T2, _} = timer:tc(gb_sets, delete_any, [500000, S]).

     32> T1.
     233342
     33> T2.
     8

All this work is to make sure that the tree is balanced in the end. In 
my opinion, this is the wrong approach to the problem. Deletions will 
unbalance the tree anyways, and rebalancing only on 'difference' won't 
solve the problem. It's certainly not worth the huge cost; 'difference' 
is basically useless when subtracting small sets from big sets, which in 
our codebase is the most common use case. The same applies for 
'is_subset'. Defining 'difference' in the simple way (fold + delete), 
works much better: https://github.com/bitonic/otp/compare/gb_difference .

Francesco.