[erlang-bugs] Bug in gb_trees ? Integer key not found.

Mon May 9 23:56:35 CEST 2011

Matthias Lang wrote:
> On Monday, May 09, Ulf Wiger wrote:
>> On 9 May 2011, at 22:01, PAILLEAU Eric wrote:
>>
>>>> T=gb_trees:from_orddict([{10,a1},{15,a2},{5,a3},{7,a4}]).
>>> {4,{5,a3,{15,a2,{10,a1,nil,nil},nil},{7,a4,nil,nil}}}
>>>
>>>> gb_trees:lookup(10,T).
>>> none
> 
>> The argument is not an orddict.

For the record, let me point out that this is not the first time this 
question arises.

> The above was the important part of your mail, but this made me wonder:
> 
>> The reason for the function is speed, obviously, and verifying that
>> the list is in fact ordered would defeat the speed advantage.
> 
> Checking that the input is ordered should be cheap since you have to
> traverse the whole input anyway. Here's gb_trees:from_orddict:
> 
>   from_orddict(L) ->
>     S = length(L),
>     {S, balance_list(L, S)}.
> 
> replacing the length() call with one which also checks input ordering
> is easy, so I did that. I measured a 5% slowdown on long (1M)
> lists. I'd live with that, but maybe not everyone would.

I strongly support adding something like the above to OTP! The benefits 
of doing so far outweigh the drawbacks. Those that cannot live with this 
overhead, should better switch to some other language, in my opinion. I 
have real trouble seeing how this change would make any measurable 
difference in some application. (On the other hand, it might prevent 
some accidental hair loss for some developers...)

Even better yet, why not make orddict() opaque so that dialyzer can 
detect constructions of "fake ordicts" (i.e. orddicts that are not 
produced by calling the functions of the orddict module) ?

> I then took a look at "who uses gb_trees:from_orddict anyway?". In
> OTP, it's mostly the beam compiler itself. This code pops up a few
> times:
> 
>   gb_trees:from_orddict(lists:sort(L))

Hmmm...

I would have expected the above to read:

    gb_trees:from_orddict(orddict:from_list(L))

instead, which seems to me more kosher. (*)

<aside>
Note that the two calls might have different semantics actually:

    1> orddict:from_list([{2,a},{1,b},{2,a},{3,c}]).
    [{1,b},{2,a},{3,c}]
    2> lists:sort([{2,a},{1,b},{2,a},{3,c}]).
    [{1,b},{2,a},{2,a},{3,c}]
</aside>

Is this call to lists:sort/1 correct/intentional?  Only the original 
developer can answer such questions, if he happens to still recall 
whether this was done just to save a millisecond or for some other 
reason...  God only knows how many possible subtle bugs may be hidden in 
such (or similar) code.

> It feels like there's a function missing in gb_trees, one which inserts a
> list of tuples which aren't sorted (but do have unique keys).

Right again.

Kostis

(*) Even stranger is code in beam_clean.erl which reads:

        Lmap = gb_trees:from_orddict(ordsets:from_list(Lmap0)),

     and

        gb_trees:from_orddict(ordsets:from_list(Acc)).