[erlang-patches] better delete_any in gb_{sets, trees}, better difference in gb_sets
Björn-Egil Dahlberg
egil@REDACTED
Tue Apr 17 16:42:35 CEST 2012
Hi Francesco,
On 2012-04-16 23:52, Francesco Mazzoli wrote:
> Hi,
>
> I have found some problems with the general balanced trees/sets code.
>
> The first one is the fact that 'delete_any' does twice the work it
> needs to be, since it first checks for membership and then uses
> 'delete'. Obviously we only need one traversal, you can find a patch
> here: https://github.com/bitonic/otp/compare/gb_delete_any .
I haven't looked too closely but it seems like a great patch. The patch
will be "included in pu". This is the same as saying: the patch will go
through our review process.
>
> The second one is regarding the performance of 'difference' and
> 'is_subset'.
>
> The gb implementation used in otp does not rebalance on deletion, and
> does rebalancing in a quite arbitrary way in general. The
> implementation does not guarantee logarithmic costs for
> lookups/insertion/removals, both as an upper bound or amortized. It
> would be very easy to fix that (the paper on gb trees itself offers a
> solution http://user.it.uu.se/~arnea/abs/gb.html with almost no
> overhead). I could patch it if I knew I had good chances of getting
> the patches integrated, but I don't want to risk losing time.
Obviously I cannot say that a patch will be included before it has been
written other than saying that; if you submit a patch we will review it.
Also, we want our community to take part in Erlang/OTP development. We
want the community help us improve our code. We want this to be easy. We
also want to ensure that patches we integrate into Erlang/OTP does not
degrade performance or functionality in any way. That is our goal.
I cannot speak for gb_trees and gb_sets specifically other than saying
that if a patch,
- fixes a problem,
- or increases performance,
- does not break existing functionality,
- works for every platform, including Windows! (mainly a c-code problem),
- and proves this with testcases and a explanation in readable text
it is a pretty safe bet it will be included in Erlang/OTP.
>
> In other words, it is very easy to end up with a degenerate tree that
> looks like a linked list, with enough well placed deletions. In
> practice, this is not a big problem, since the height of the three
> when inserting will be bounded logarithmically anyways, so it's
> impossible to end up in tragic situations.
>
> First of all, this fact should be well documented in the docs. I
> surely wasn't expecting that, and I had to look at the source to find
> out.
I agree.
>
> More importantly, even more unpredictably, 'difference' is O(n*log(n))
> on the size of the set we're subtracting from, with dreadful
> consequences:
>
> S = gb_sets:from_list(lists:seq(1, 1000000)).
> {T1, _} = timer:tc(gb_sets, difference, [S,
> gb_sets:from_list([500000])]).
> {T2, _} = timer:tc(gb_sets, delete_any, [500000, S]).
>
> 32> T1.
> 233342
> 33> T2.
> 8
>
> All this work is to make sure that the tree is balanced in the end. In
> my opinion, this is the wrong approach to the problem. Deletions will
> unbalance the tree anyways, and rebalancing only on 'difference' won't
> solve the problem. It's certainly not worth the huge cost;
> 'difference' is basically useless when subtracting small sets from big
> sets, which in our codebase is the most common use case. The same
> applies for 'is_subset'. Defining 'difference' in the simple way (fold
> + delete), works much better:
> https://github.com/bitonic/otp/compare/gb_difference .
>
This will also be reviewed.
Thank you for taking the time to send us your findings and results.
Regards,
Björn-Egil
Erlang/OTP
More information about the erlang-patches
mailing list