Efficient Functional Data Structures

Sat Jan 4 06:16:03 CET 2003

Is there any good information out there on efficient functional data
structures?  This is my first crack at one in Erlang and I'd like
someone to poke holes in my implementation and point out Erlang hot
spots.

This is a trie.  I think the word trie is taken from reTRIEve.  A useful
definition for my purposes here is that a trie is an unbalanced tree
with a child node for every possible next term in a sentence.  So you
can think of a trie as a data structure that can store every possible
list of Erlang terms in a tight space.  (Tries are sometimes used as
symbol tables in parsers; since they don't require you to store each
identifier you get good space savings for very large symbol tables.)

Walking a trie from root to any node recreates the list of terms that
was stored in the trie.

I am impressed by Erlang's ability to represent this functional data
structure so concisely.  This really is impressive.

Each node in the trie contains a dictionary that maps the next term to
another node and a "slot" (a list of length 1) that stores an object.  I
have a variation on this that appends to the slot thereby redefining the
slot as a list of objects.  In that case duplicate strings are allowed.

-module (trie).
...

% Creates an empty trie.
new () ->
    {[], []}.

% Adds or updates an object with the given key and returns the new trie.
store ([], Object, {TrieDict, _}) ->
    {TrieDict, [Object]};
store ([Elem | StringTail], Object, {TrieDict, Slot}) ->
    case orddict:find (Elem, TrieDict) of
        {ok, Subtrie} ->
            {orddict:store (Elem, store (StringTail, Object, Subtrie),
TrieDict), Slot};
        error ->
            {orddict:store (Elem, store (StringTail, Object, new ()),
TrieDict), Slot}
    end.

Examples:

T0 = trie:new().
T1 = trie:store ("there", 1, T0).
T2 = trie:store ("therein", 2, T1).
T3 = trie:store ("therein lies the problem", 3, T2).
T4 = trie:store ([therein,lies,the,problem], 4, T3).

I have search functions that I've omitted because I am primarily
concerned about the efficiency of building the trie and updating the
object stored in the "slot".

I use orddict because dict seems to have a lot of space overhead for
small dictionaries and I need to know the representation for other
operations and pattern-matching on the empty dictionary [].

I have two concerns.

The small concern:  Every time I add a child node I need to rebuild the
trie all the way back up to the root.  {orddict:store (Elem, store
(...), ...), ...}

The big concern:  I need to replace objects stored in each "slot" at a
high rate ~10,000 times per second.  (Or at least I desire this; I don't
think I can currently do it on a 1GHz CPU.)

It seems to me that if Erlang had a "ref" object like ObjectiveCaml I
could introduce a small side-effect and store a reference to a value in
the trie instead of rebuilding the trie every time the object is
updated.

Questions for the experts:  What can I do with "textbook" Erlang?  What
can I do to cheat (possibly with a C "hack")?  Am I doing things as good
as I can with clean Erlang?  What can I do to speed things up?

Sincerely,
Eric Newhuis