[erlang-patches] better delete_any in gb_{sets, trees}, better difference in gb_sets
Francesco Mazzoli
francesco@REDACTED
Mon Apr 16 23:52:31 CEST 2012
Hi,
I have found some problems with the general balanced trees/sets code.
The first one is the fact that 'delete_any' does twice the work it needs
to be, since it first checks for membership and then uses 'delete'.
Obviously we only need one traversal, you can find a patch here:
https://github.com/bitonic/otp/compare/gb_delete_any .
The second one is regarding the performance of 'difference' and 'is_subset'.
The gb implementation used in otp does not rebalance on deletion, and
does rebalancing in a quite arbitrary way in general. The implementation
does not guarantee logarithmic costs for lookups/insertion/removals,
both as an upper bound or amortized. It would be very easy to fix that
(the paper on gb trees itself offers a solution
http://user.it.uu.se/~arnea/abs/gb.html with almost no overhead). I
could patch it if I knew I had good chances of getting the patches
integrated, but I don't want to risk losing time.
In other words, it is very easy to end up with a degenerate tree that
looks like a linked list, with enough well placed deletions. In
practice, this is not a big problem, since the height of the three when
inserting will be bounded logarithmically anyways, so it's impossible to
end up in tragic situations.
First of all, this fact should be well documented in the docs. I surely
wasn't expecting that, and I had to look at the source to find out.
More importantly, even more unpredictably, 'difference' is O(n*log(n))
on the size of the set we're subtracting from, with dreadful consequences:
S = gb_sets:from_list(lists:seq(1, 1000000)).
{T1, _} = timer:tc(gb_sets, difference, [S,
gb_sets:from_list([500000])]).
{T2, _} = timer:tc(gb_sets, delete_any, [500000, S]).
32> T1.
233342
33> T2.
8
All this work is to make sure that the tree is balanced in the end. In
my opinion, this is the wrong approach to the problem. Deletions will
unbalance the tree anyways, and rebalancing only on 'difference' won't
solve the problem. It's certainly not worth the huge cost; 'difference'
is basically useless when subtracting small sets from big sets, which in
our codebase is the most common use case. The same applies for
'is_subset'. Defining 'difference' in the simple way (fold + delete),
works much better: https://github.com/bitonic/otp/compare/gb_difference .
Francesco.
More information about the erlang-patches
mailing list