[erlang-questions] Full text search in KV-storage.

Sat Sep 24 01:58:19 CEST 2011

Jesper, thank you for the insights.
I'm far from linguistics and full text search engines, do your input was particulary useful.
And what about my idea that I have briefly described?
I'm sure that it has many flaws such as m-lookup and re-indexing problem, but that's how I store non-full text search indexes at the moment such as tag searches and searches by author. So my mind was spinning around the idea of homogeneous O(1) indexes.
To those who suggest using 3rd party storages - too much of the current logic relies on a "native" storage I currently use so I'd be glad to stick with it.

----- Reply message -----
From: "Jesper Louis Andersen" <jesper.louis.andersen@REDACTED>
Date: Fri, Sep 23, 2011 7:58 pm
Subject: [erlang-questions] Full text search in KV-storage.
To: "Oleg Chernyh" <erlang@REDACTED>
Cc: <erlang-questions@REDACTED>

On Fri, Sep 23, 2011 at 18:19, Oleg Chernyh <erlang@REDACTED> wrote:

> 1) We make a set of all words W

Two structures immediately suggest themselves:

* Tries. That is store a digital tree on the word. "CAT" will search
in a 26-way tree through the C-edge, the A-edge and the T-edge. If you
need a base 62 or a base 36 match, it may be beneficial to compress
nodes that are not totally full. An evil way is to use an ets table
with the compressed flag set and call ets:lookup_element/3 on the
beast. This is how a spelling checker often stores its dictionary.

* Bloom filter based structures. Look at how postgresql does it with
GiST indexes for instance:

http://www.postgresql.org/docs/8.3/static/textsearch-indexes.html

For each w \in document, you hash word, h(w) = n, and set the n'th bit
in a bit string. Then you or all hashes together \/_{w \in D} h(w).
This gives an approximate match, but hopefully it weeds out a lot of
things you need to search and the amount of false positives is
probably small. These bit strings can be stored as e.g., crit-bit
trees (also called radix or patricia trees - another trie-structure,
but this time binary). See for instance,

http://cr.yp.to/critbit.html by Dan Bernstein.

I did a crit-bit tree structure in Erlang some months ago,
https://github.com/jlouis/bench-map/blob/master/src/patricia.erl and
also one based on Hash-array Mapped Tries (HAMT, the basic data
structure in Clojures dictionaries),
https://github.com/jlouis/bench-map/blob/master/src/hamt.erl .. While
they do not outperform dict, gb_trees and Robert Virdings RB trees for
the workloads I've tried, the performance isn't too bad and they will
probably pack much better for this kind of thing.

Do note however that this is quite specialized and you can probably
win some raw speed by using a pre-baked solution. My suggestions
should take you up in speed by quite a lot, however.

-- 
J.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20110924/a2f1235f/attachment.htm>