[erlang-questions] Data structures are strange things

Fri Jun 8 02:55:29 CEST 2012

I was finally going to release some alternate modules with the dict interface using tree structures, either AA trees or Red Black trees. These are relatively well-known binary tree structures which are binary tree versions of 2-3 trees and 2-3-4 trees respectively. These contain codes with either 2, 3 or 4 sub-nodes. Generally the binary versions are recommended as being better/easier/more space efficient. 

Before I released my modules I thought it would be interesting to try the non-binary versions to see how they behaved and what it was lie to program them. I was lucky to find a very good description of 2-3 trees and the rules for inserting/deleting elements; it was literally just translate the rules straight to Erlang and it worked. The code was a little more complex than working with the binary versions, not much. In some cases actually simpler. The I decided to measure speed, and the fun started. 

All tests were done with sequences of random numbers as keys, the same for each implementation. 

What I found was that the lookup times were about the same as for the binary versions, which was expected, but the insertion/deletion times were about 20% faster. This compared to my AA and RB tree versions. To check that it wasn't just bad coding on my part (not bloody likely :-)) I compared it with gb_trees and got the similar results. Also checking the data sizes I found that the 2-3 and 2-3-4 trees actually used *less* space than their binary counterparts. 

Here are the figures. I have also included dict, which is the slowest on insertion/deletion but significantly faster lookup times. All times are in usecs for the whole set: 

Operation                        Load        Fetch       Load        Fetch 
# of elements                         100000                  10000 
Overhead time                          43000                   3700 
ttdict 2-3 trees 310000 161000 24000 15500 
ttfdict 2-3-4 trees 312000 173000 24200 16500 
llrbdict Left Leaning RB trees 390000 164000 27100 14900 

rbdict Red Black trees 380000 167000 27100 15000 
aadict AA trees 423000 167000 30000 15000 

gb_trees using faster insert/3 370000 190000 26500 16200 
dict 522000 107000 28000 11000 
(I hope the table is readable) 
Gb_trees are a variant of AA trees with a slightly different algorithm. One should be careful interpreting exact values here but the trend is clear. 

I don't know why the straight multi-size versions are faster but my guess is that the pattern-matching/data construction is faster on the flat bigger nodes rather than on the nested pairs. My guess is that when they comment on the 2-3(-4) trees using more space it is because they use the same size node for all cases which in the cases with fewer sub-nodes wastes space, while just used different sized tuples which didn't waste space; it also made the pattern matching much easier. 

It looks like I will be submitting the 2-3 tree version, ttdict, as reasonable choice. Next I will try AVL trees which were discarded long ago and see how they fare. 

Robert 

P.S. This is the reason why there was hoped to be many different dict modules with the same API but different data structures. So as it to make it easier to choose the most suitable version for each occasion. There is no generally "best" choice, even among trees. 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20120608/b26cabb3/attachment.htm>