[erlang-questions] gb_trees and dict performance
Fred Hebert
mononcqc@REDACTED
Wed Dec 9 16:19:02 CET 2015
On 12/09, Vladimir Ralev wrote:
>I am curious about the O-complexity of the operations in these data
>structures in Erlang. It would seem to me that because of immutability
>adding elements in those structures would often require a duplication
>of the data structure and copying is an O(N) operation itself. So if
>you insert an element in a dict or tree instead of O(1) or O(logN) it
>would be O(N) complexity in both time and memory. I realize there are
>some shortcuts done internally to reuse the memory but I imagine that
>is not always possible.
>
>Two questions:
>
>1. Can someone comment on the average complexity, is it in line with
>what the collections in other languages offer?
>
Because the data structures are immutable, you are awarded the
possibility of greater structural sharing. When you update the tree,
nodes that are in common are kept, and otherwise part of the tree points
to the old one, giving you something a bit like this in memory:
http://i.imgur.com/FxFFxuP.png
Basically, the entire unchanged subtree starting at `B' is kept for both
`Old' and `New' trees, but the rest changes in some way. The `H' node is
also shared because it had no reason to change. The `F' node had to be
copied and rewritten to allow to change what child it has, but note that
if the content of `F' is large, that can be shared too. I.e. if my node
is `{Current, Left, Right}', I need to change the node, but could still
point to the existing `Current' value from the older tree.
Garbage collection will later determine what can be reused or not. This
lets you update trees in O(log n) memory as well as time, rather than
O(n) memory if no sharing took place and deep copies were needed,
regardless of the content of the nodes.
>2. If you have a job interview and are asked what is the complexity of
>dict:store() and gb_trees:enter() what would you say?
gb_trees:enter/3 would be O(log N). It's a general balanced tree, and
that's as safe of a guess as any other. It has a branching factor of
two.
Dicts are trickier, as they rely on a nested bucket approach.
The bucket approach is basically going to be one gigantic tuple at the
top, linking to many smaller tuples at the bottom. The tuples at the
bottom will contain lists of elements that match each hash. So for
example, for a dictionary with 1,000,000 keys, you may get a top tuple
with about ~16k elements in it, each of which is a tuple containing
anywhere between 16 to 32 elements. The hashmap is implemented by going
down in these, and contracting or expanding them as needed.
You're never really going to do more than two hops down before finding
the list element you need (and then possibly going down the whole thing)
so it would be tempting to declare it to be O(1) for reads (I'm doing
all of this very informally), but for writes, you still have to modify
the bottom slot, and then copy and recreate a new top-level tuple, which
is costly, (even more so if you expand and have to re-hash). Creating
tuples are cheap, but the bigger it is, the costlier it is to do to
reference elements from the old one to the new one as there is more of
them.
So the weird thing is that this would intuitively be an O(1) insert, but
in practice, the cost of doing an update in the tree increases as the
number of buckets augments with the dictionary size, which would be
O(log N) with a branching factor of 16 or so. Anyway that would be my
guess given, the lowest tuple is between 16 and 32 elements, and that
size is what divides up the total number of entries to restrict the
top-level tuple.
But the actual costs are a bit worse than that in terms of how they
scale up, probably due to theory meeting the real world. Here's the
measured cost of adding a single element in a dict as a very
unscientific benchmark on my work laptop:
Elements in Tree vs. Time to insert (µs)
- 10: 3µs
- 100: 4µs
- 1000: 7µs
- 10000: 8µs
- 100000: 11µs
- 1000000: 147µs
- 10000000: just creating the base dictionary takes so long I got fed up
Regards,
Fred.
More information about the erlang-questions
mailing list