[erlang-questions] dict, orddict and gb_tree

Wed Feb 14 01:37:19 CET 2007

Robert Baruch wrote:
> On Feb 9, 2007, at 10:11 AM, Richard Carlsson wrote:
> 
>>gb_trees is slightly more lightweight and a little bit faster on most
>>operations than dict, so they are particularly good if you have large
>>numbers of moderately sized tables. But in general, you're not likely
>>to notice any difference until you really start to push the limits, so
>>for everyday programs, you can use either without loss.
> 
> I wrote some code a few days ago that had to store about 17,000 key/ 
> value pairs, where the key was an integer. I found that gb_trees was  
> far faster in lookup than dict was. I'm not sure if 17,000 is  
> supposed to be past the limits, but it sure was in my case :)

I had to test this so I wrote a small test program to compare the 
different implementations. It builds dictionaries with integer keys and 
a single atom as value. Three things are tested: add a new key/value to 
a dictionary, fetch all the values, and update all the values. To try 
and get a little less regular sequence of keys the keys are taken from 
the middle out of the ranges. So, for example, with 1000 keys the 
sequence is 500,501,499,502,498,503,...

Gb_trees has different functions for adding/updating the trees depending 
on whether you KNOW if the key already exists, add, enter and update. 
There are two values for gb_trees for adding a new value and updating an 
existing value, they are using add/enter and update/enter.Enter takes 
longer time as it much first check whether the key is already in the 
tree. I don't know if this due to the algorithm or the implementation. 
If this is a problem depends, of course, on your application.

For fun I implemented red-black trees directly out of Okasaki to get 
some more comparison, this is rbdict. It is dict/orddict compatible. Of 
course.

Here are the tables, first one with orddict for the smaller sizes and 
then one without (it was getting too wide to fit in). The figures are 
millisec and rough averages over a number of runs. They are good enough 
to see the trends.

Size        orddict          dict           rbdict          gb_trees
         add  fet  upd   add  fet  upd  add  fet  upd    add   fet  upd

   5000  563  550 1300
  10000 2250 2200 4718    32   15   32   31   15   31   78/125  15  31/31
  20000 8500 9200 18890   94   15   70   78   15   70  200/266  16  46/79

Size          dict             rbdict               gb_trees
          add   fet   upd    add  fet  upd        add    fet   upd

  10000    32   15    32     31   15    31     78/125    15   31/31
  20000    94   15    70     78   15    70    200/266    16   46/79
  50000   300   40   300    200   40   180    625/830    45  120/200
100000   850  100   850    420   94   390   1450/1900   95  266/438
200000  2580  230  2610    890  190   870   3300/5450  190  550/920
500000 17400  680 21500   2420  500  2350  11750/19000 500 1440/2520

Some comments about the results:

1. orddict is nice and quadratic as it should be. It is really only 
suitable for small trees.
2. Seeing I implemented dict I must say that it produced good results 
and hung on quite well, until about 50000 elements. This is not due to 
algorithm itself but rather due to the way in which the internal table 
is grown. Should probaby be fixed.
3. gb_trees is very slow on adding new values, but rather fast on 
fetching and updating them. I don't know why this is so.
4. It is a big win for gb_trees if you know whether you have an existing 
key or not when modifying the table.
5. I was a bit surprised that red-black trees could match gb_trees in 
fetching as red-black allows some inbalance in the tree. Perhaps the 
order of insertion was right for it?
6. (The obvious) There is no best algorithm but you should test your 
application to find out which is best for you.

It was to help with the last point that dict/orddict have exactly the 
same interface. The idea was that you could have a whole range of XXdict 
libraries and choose the one which best fit your needs, all you need to 
do to use another is change the module name, a DICT macro.

I may just complete rbdict and add it to the libraries. This may be the 
time to extend the interface with more useful functions.

I don't know if the has helped any, or just added to the confusion, but 
at least we have figures. :-) I include my test module, check for any 
really bad errors.

Robert

P.S. I wonder what trapexit will do to these tables? :-)