[erlang-questions] Full text search in KV-storage.
Sat Sep 24 00:27:55 CEST 2011
I know you said you don't want to use an external solution but perhaps Riak
 is an exception given it's design to scale out and the fact that it's
written in Erlang? It has full-text search capability . Might be worth
On Fri, Sep 23, 2011 at 12:19 PM, Oleg Chernyh <erlang@REDACTED> wrote:
> I'm writing a forum engine that can sustain high amount of users making
> queries at the same time.
> I'm using a KV storage that stores erlang entities which is the way to kill
> overheads and make data accessible really fast (if we know the key, ofc).
> The problem I'm facing is full-text search, as you might guess.
> I don't really want to use any external indexers or databases because I
> want that piece of software to be scalable node-wise and don't want to add
> any more data storage entites to the system to prevent data duplication and
> to ensure consistency.
> So what I want to do is to "invent bicycle" by implementing a full text
> search on a key-value storage.
> Here I'll outline my plan for writing that feature and I'd be happy to hear
> A simplistic way to think of a full text search engine is a strict text
> search engine, when the result of a search is a list of messages (posts)
> that contain full search keywords (if we search for "abc" only "abc" words
> are matched, so that "abcd" won't match).
> In order to accomplish that I can think of the following things to do:
> 1) We make a set of all words W
> 2) For each w \in W we have an "index" I_w which returns a list of all
> posts that contain w
> When we add a pure full text search functionality (indeed, we might want
> "geology" to match "geologist") we do the following:
> 1) make a set of normalized words W_n
> 2) make a set of common prefixes P
> 3) make a set of common suffixes S
> 4) define a normalization function f that strips common suffixes and
> prefixes from a word
> 5) define a matching function m that maps f(w) to a list of words from the
> W_n set
> 6) for each w in W_n we keep an "index" W_n that returns all the messages M
> for which \exists x \in M : m(f(x)) = w
> 7) for each word m in new message
> erlang-questions mailing list
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the erlang-questions