[erlang-questions] Full text search in KV-storage.

Fri Sep 23 18:58:30 CEST 2011

On Fri, Sep 23, 2011 at 18:19, Oleg Chernyh <erlang@REDACTED> wrote:

> 1) We make a set of all words W

Two structures immediately suggest themselves:

* Tries. That is store a digital tree on the word. "CAT" will search
in a 26-way tree through the C-edge, the A-edge and the T-edge. If you
need a base 62 or a base 36 match, it may be beneficial to compress
nodes that are not totally full. An evil way is to use an ets table
with the compressed flag set and call ets:lookup_element/3 on the
beast. This is how a spelling checker often stores its dictionary.

* Bloom filter based structures. Look at how postgresql does it with
GiST indexes for instance:

http://www.postgresql.org/docs/8.3/static/textsearch-indexes.html

For each w \in document, you hash word, h(w) = n, and set the n'th bit
in a bit string. Then you or all hashes together \/_{w \in D} h(w).
This gives an approximate match, but hopefully it weeds out a lot of
things you need to search and the amount of false positives is
probably small. These bit strings can be stored as e.g., crit-bit
trees (also called radix or patricia trees - another trie-structure,
but this time binary). See for instance,

http://cr.yp.to/critbit.html by Dan Bernstein.

I did a crit-bit tree structure in Erlang some months ago,
https://github.com/jlouis/bench-map/blob/master/src/patricia.erl and
also one based on Hash-array Mapped Tries (HAMT, the basic data
structure in Clojures dictionaries),
https://github.com/jlouis/bench-map/blob/master/src/hamt.erl .. While
they do not outperform dict, gb_trees and Robert Virdings RB trees for
the workloads I've tried, the performance isn't too bad and they will
probably pack much better for this kind of thing.

Do note however that this is quite specialized and you can probably
win some raw speed by using a pre-baked solution. My suggestions
should take you up in speed by quite a lot, however.

-- 
J.