[erlang-patches] better delete_any in gb_{sets, trees}, better difference in gb_sets

Björn-Egil Dahlberg egil@REDACTED
Tue Apr 17 16:42:35 CEST 2012


Hi Francesco,


On 2012-04-16 23:52, Francesco Mazzoli wrote:
> Hi,
>
> I have found some problems with the general balanced trees/sets code.
>
> The first one is the fact that 'delete_any' does twice the work it 
> needs to be, since it first checks for membership and then uses 
> 'delete'. Obviously we only need one traversal, you can find a patch 
> here: https://github.com/bitonic/otp/compare/gb_delete_any .

I haven't looked too closely but it seems like a great patch. The patch 
will be "included in pu". This is the same as saying: the patch will go 
through our review process.

>
> The second one is regarding the performance of 'difference' and 
> 'is_subset'.
>
> The gb implementation used in otp does not rebalance on deletion, and 
> does rebalancing in a quite arbitrary way in general. The 
> implementation does not guarantee logarithmic costs for 
> lookups/insertion/removals, both as an upper bound or amortized. It 
> would be very easy to fix that (the paper on gb trees itself offers a 
> solution http://user.it.uu.se/~arnea/abs/gb.html with almost no 
> overhead). I could patch it if I knew I had good chances of getting 
> the patches integrated, but I don't want to risk losing time.

Obviously I cannot say that a patch will be included before it has been 
written other than saying that; if you submit a patch we will review it.

Also, we want our community to take part in Erlang/OTP development. We 
want the community help us improve our code. We want this to be easy. We 
also want to ensure that patches we integrate into Erlang/OTP does not 
degrade performance or functionality in any way. That is our goal.

I cannot speak for gb_trees and gb_sets specifically other than saying 
that if a patch,

- fixes a problem,
- or increases performance,
- does not break existing functionality,
- works for every platform, including Windows! (mainly a c-code problem),
- and proves this with testcases and a explanation in readable text

it is a pretty safe bet it will be included in Erlang/OTP.

>
> In other words, it is very easy to end up with a degenerate tree that 
> looks like a linked list, with enough well placed deletions. In 
> practice, this is not a big problem, since the height of the three 
> when inserting will be bounded logarithmically anyways, so it's 
> impossible to end up in tragic situations.
>
> First of all, this fact should be well documented in the docs. I 
> surely wasn't expecting that, and I had to look at the source to find 
> out.

I agree.

>
> More importantly, even more unpredictably, 'difference' is O(n*log(n)) 
> on the size of the set we're subtracting from, with dreadful 
> consequences:
>
>     S = gb_sets:from_list(lists:seq(1, 1000000)).
>     {T1, _} = timer:tc(gb_sets, difference, [S, 
> gb_sets:from_list([500000])]).
>     {T2, _} = timer:tc(gb_sets, delete_any, [500000, S]).
>
>     32> T1.
>     233342
>     33> T2.
>     8
>
> All this work is to make sure that the tree is balanced in the end. In 
> my opinion, this is the wrong approach to the problem. Deletions will 
> unbalance the tree anyways, and rebalancing only on 'difference' won't 
> solve the problem. It's certainly not worth the huge cost; 
> 'difference' is basically useless when subtracting small sets from big 
> sets, which in our codebase is the most common use case. The same 
> applies for 'is_subset'. Defining 'difference' in the simple way (fold 
> + delete), works much better: 
> https://github.com/bitonic/otp/compare/gb_difference .
>

This will also be reviewed.

Thank you for taking the time to send us your findings and results.

Regards,
Björn-Egil
Erlang/OTP



More information about the erlang-patches mailing list