[erlang-questions] dict, orddict and gb_tree

Wed Feb 14 02:28:19 CET 2007

Really great! I like to see figures so that I have an idea of what is "a
large amount of data".

Of course, it would be good to have this in trapexit, or maybe even better
in the OTP/Erlang documentation itself (thinking to a chapter "Data
structures/containers" in section "Programming examples"). I really do think
all participations in this mailing list thread are important and I'm not
fond of bookmarking mailing list as documentation references :).
Maybe a good time to learn how to create a post in trapexit ;).

Thanks!

On 2/14/07, Robert Virding <robert.virding@REDACTED> wrote:
>
> Robert Baruch wrote:
> > On Feb 9, 2007, at 10:11 AM, Richard Carlsson wrote:
> >
> >>gb_trees is slightly more lightweight and a little bit faster on most
> >>operations than dict, so they are particularly good if you have large
> >>numbers of moderately sized tables. But in general, you're not likely
> >>to notice any difference until you really start to push the limits, so
> >>for everyday programs, you can use either without loss.
> >
> > I wrote some code a few days ago that had to store about 17,000 key/
> > value pairs, where the key was an integer. I found that gb_trees was
> > far faster in lookup than dict was. I'm not sure if 17,000 is
> > supposed to be past the limits, but it sure was in my case :)
>
> I had to test this so I wrote a small test program to compare the
> different implementations. It builds dictionaries with integer keys and
> a single atom as value. Three things are tested: add a new key/value to
> a dictionary, fetch all the values, and update all the values. To try
> and get a little less regular sequence of keys the keys are taken from
> the middle out of the ranges. So, for example, with 1000 keys the
> sequence is 500,501,499,502,498,503,...
>
> Gb_trees has different functions for adding/updating the trees depending
> on whether you KNOW if the key already exists, add, enter and update.
> There are two values for gb_trees for adding a new value and updating an
> existing value, they are using add/enter and update/enter.Enter takes
> longer time as it much first check whether the key is already in the
> tree. I don't know if this due to the algorithm or the implementation.
> If this is a problem depends, of course, on your application.
>
> For fun I implemented red-black trees directly out of Okasaki to get
> some more comparison, this is rbdict. It is dict/orddict compatible. Of
> course.
>
> Here are the tables, first one with orddict for the smaller sizes and
> then one without (it was getting too wide to fit in). The figures are
> millisec and rough averages over a number of runs. They are good enough
> to see the trends.
>
>
> Size        orddict          dict           rbdict          gb_trees
>          add  fet  upd   add  fet  upd  add  fet  upd    add   fet  upd
>
>    5000  563  550 1300
>   10000 2250 2200 4718    32   15   32   31   15   31   78/125  15  31/31
>   20000 8500 9200 18890   94   15   70   78   15   70  200/266  16  46/79
>
> Size          dict             rbdict               gb_trees
>           add   fet   upd    add  fet  upd        add    fet   upd
>
>   10000    32   15    32     31   15    31     78/125    15   31/31
>   20000    94   15    70     78   15    70    200/266    16   46/79
>   50000   300   40   300    200   40   180    625/830    45  120/200
> 100000   850  100   850    420   94   390   1450/1900   95  266/438
> 200000  2580  230  2610    890  190   870   3300/5450  190  550/920
> 500000 17400  680 21500   2420  500  2350  11750/19000 500 1440/2520
>
>
> Some comments about the results:
>
> 1. orddict is nice and quadratic as it should be. It is really only
> suitable for small trees.
> 2. Seeing I implemented dict I must say that it produced good results
> and hung on quite well, until about 50000 elements. This is not due to
> algorithm itself but rather due to the way in which the internal table
> is grown. Should probaby be fixed.
> 3. gb_trees is very slow on adding new values, but rather fast on
> fetching and updating them. I don't know why this is so.
> 4. It is a big win for gb_trees if you know whether you have an existing
> key or not when modifying the table.
> 5. I was a bit surprised that red-black trees could match gb_trees in
> fetching as red-black allows some inbalance in the tree. Perhaps the
> order of insertion was right for it?
> 6. (The obvious) There is no best algorithm but you should test your
> application to find out which is best for you.
>
>
> It was to help with the last point that dict/orddict have exactly the
> same interface. The idea was that you could have a whole range of XXdict
> libraries and choose the one which best fit your needs, all you need to
> do to use another is change the module name, a DICT macro.
>
> I may just complete rbdict and add it to the libraries. This may be the
> time to extend the interface with more useful functions.
>
> I don't know if the has helped any, or just added to the confusion, but
> at least we have figures. :-) I include my test module, check for any
> really bad errors.
>
> Robert
>
> P.S. I wonder what trapexit will do to these tables? :-)
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://www.erlang.org/mailman/listinfo/erlang-questions
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20070214/1f227c85/attachment.htm>