[erlang-questions] gb_trees and dict performance

Wed Dec 9 16:19:02 CET 2015

On 12/09, Vladimir Ralev wrote:
>I am curious about the O-complexity of the operations in these data
>structures in Erlang. It would seem to me that because of immutability
>adding elements in those structures would often require a duplication
>of the data structure and copying is an O(N) operation itself. So if
>you insert an element in a dict or tree instead of O(1) or O(logN) it
>would be O(N) complexity in both time and memory. I realize there are
>some shortcuts done internally to reuse the memory but I imagine that
>is not always possible.
>
>Two questions:
>
>1. Can someone comment on the average complexity, is it in line with
>what the collections in other languages offer?
>

Because the data structures are immutable, you are awarded the 
possibility of greater structural sharing. When you update the tree, 
nodes that are in common are kept, and otherwise part of the tree points 
to the old one, giving you something a bit like this in memory:

http://i.imgur.com/FxFFxuP.png

Basically, the entire unchanged subtree starting at `B' is kept for both 
`Old' and `New' trees, but the rest changes in some way. The `H' node is 
also shared because it had no reason to change. The `F' node had to be 
copied and rewritten to allow to change what child it has, but note that 
if the content of `F' is large, that can be shared too. I.e. if my node 
is `{Current, Left, Right}', I need to change the node, but could still 
point to the existing `Current' value from the older tree.

Garbage collection will later determine what can be reused or not. This 
lets you update trees in O(log n) memory as well as time, rather than 
O(n) memory if no sharing took place and deep copies were needed, 
regardless of the content of the nodes.

>2. If you have a job interview and are asked what is the complexity of
>dict:store() and gb_trees:enter() what would you say?

gb_trees:enter/3 would be O(log N). It's a general balanced tree, and 
that's as safe of a guess as any other. It has a branching factor of 
two.

Dicts are trickier, as they rely on a nested bucket approach.

The bucket approach is basically going to be one gigantic tuple at the 
top, linking to many smaller tuples at the bottom. The tuples at the 
bottom will contain lists of elements that match each hash. So for 
example, for a dictionary with 1,000,000 keys, you may get a top tuple 
with about ~16k elements in it, each of which is a tuple containing 
anywhere between 16 to 32 elements. The hashmap is implemented by going 
down in these, and contracting or expanding them as needed.

You're never really going to do more than two hops down before finding 
the list element you need (and then possibly going down the whole thing) 
so it would be tempting to declare it to be O(1) for reads (I'm doing 
all of this very informally), but for writes, you still have to modify 
the bottom slot, and then copy and recreate a new top-level tuple, which 
is costly, (even more so if you expand and have to re-hash). Creating 
tuples are cheap, but the bigger it is, the costlier it is to do to 
reference elements from the old one to the new one as there is more of 
them.

So the weird thing is that this would intuitively be an O(1) insert, but 
in practice, the cost of doing an update in the tree increases as the 
number of buckets augments with the dictionary size, which would be 
O(log N) with a branching factor of 16 or so. Anyway that would be my 
guess given, the lowest tuple is between 16 and 32 elements, and that 
size is what divides up the total number of entries to restrict the 
top-level tuple.

But the actual costs are a bit worse than that in terms of how they 
scale up, probably due to theory meeting the real world. Here's the 
measured cost of adding a single element in a dict as a very 
unscientific benchmark on my work laptop:

Elements in Tree  vs. Time to insert (µs)
- 10: 3µs
- 100: 4µs
- 1000: 7µs
- 10000: 8µs
- 100000: 11µs
- 1000000: 147µs
- 10000000: just creating the base dictionary takes so long I got fed up

Regards,
Fred.