[erlang-questions] Full text search in KV-storage.

Sun Sep 25 16:58:37 CEST 2011

Hello,

I wrote a full-text index in Erlang a while back. It's part of elib1
which can be found
at https://github.com/joearms/elib1

elib1 contain a lot-lot more than just an index so it may not be
apparent how to use
it - you have to first install elib1 to even read the documentation
(which is probably
a mistake).

The indexer in elib1 is a more-or-less literal translation of the algorithms
in the book "Managing Gigabytes" by I.A.Witten, A.Moffat and T.C. Bell.
(and by the way, implementing a full-text indexer without having read the above
is a foolish thing to do - and I should know since I wrote two
full-text indexer before
reading this :-)

The tricky bits in an indexer are creating the inverted index (hint
use the erlang
module file_sorter.erl (in the standard distribution) this is
blazingly fast and possible
my favorite "module in the Erlang distribution that nobody has ever
heard of but which is
amazingly good". The you have to compress the index - I use gamma compression
described in the above book and implemented in
https://github.com/joearms/elib1/blob/master/lib/src/elib1_gamma.erl.
Index entries are represented by a variable bit length code. The
trickiest part of all is to decide which words in the document
should be indexed - ie answering the question "what is an English word" -
one solution is to create all know trigrams (ie combination of three
letters in a row)
from all words in the english language and apply this as a filter to recognise
unknown words. There is code to do this in my Erlang book :-) But even this
gives unsatisfactory results. The twitter idea of indexing strings
starting with #
is much easier and yields better results.

One of these years I might break out the full-text indexer from elib1
as a stand-alone
application to make it easier to integrate. A further path than might be worth
 investigating would be to make a lucene front-end. The lucene file format is
well documented - one might let some standard jaava of c++ indexing
engine performing the actual indexing and build the necessary index on
disk, and let
Erlang do the queries. This way you'll only have to unpack the disk
data structures
and interpret the results. Thinking out the appropriate disk data structures
for a full-tet index is not easy. If you want to index several GBytes
of data you will
not be able to work in-memory and a lot of thought has to be given to
how to cache
things in memory an avoid unnecessary disk reads and writes. This has
been studied for
years. I'm a great fan of reinventing the wheel but reading "Managing Gigbytes"
will give a lot of insights into how to do this.

Cheers

/Joe

On Fri, Sep 23, 2011 at 6:19 PM, Oleg Chernyh <erlang@REDACTED> wrote:
> I'm writing a forum engine that can sustain high amount of users making
> queries at the same time.
> I'm using a KV storage that stores erlang entities which is the way to kill
> overheads and make data accessible really fast (if we know the key, ofc).
> The problem I'm facing is full-text search, as you might guess.
> I don't really want to use any external indexers or databases because I want
> that piece of software to be scalable node-wise and don't want to add any
> more data storage entites to the system to prevent data duplication and to
> ensure consistency.
> So what I want to do is to "invent bicycle" by implementing a full text
> search on a key-value storage.
>
> Here I'll outline my plan for writing that feature and I'd be happy to hear
> criticism.
>
> A simplistic way to think of a full text search engine is a strict text
> search engine, when the result of a search is a list of messages (posts)
> that contain full search keywords (if we search for "abc" only "abc" words
> are matched, so that "abcd" won't match).
> In order to accomplish that I can think of the following things to do:
> 1) We make a set of all words W
> 2) For each w \in W we have an "index" I_w which returns a list of all posts
> that contain w
>
> When we add a pure full text search functionality (indeed, we might want
> "geology" to match "geologist") we do the following:
> 1) make a set of normalized words W_n
> 2) make a set of common prefixes P
> 3) make a set of common suffixes S
> 4) define a normalization function f that strips common suffixes and
> prefixes from a word
> 5) define a matching function m that maps f(w) to a list of words from the
> W_n set
> 5.1)
> 6) for each w in W_n we keep an "index" W_n that returns all the messages M
> for which \exists x \in M : m(f(x)) = w
> 7) for each word m in new message
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>