[erlang-questions] Full text search in KV-storage.

Kresten Krab Thorup krab@REDACTED
Mon Sep 26 19:58:59 CEST 2011


You should take a look at https://github.com/basho/merge_index

Kresten 

On 23/09/2011, at 18.20, "Oleg Chernyh" <erlang@REDACTED> wrote:

> I'm writing a forum engine that can sustain high amount of users  
> making queries at the same time.
> I'm using a KV storage that stores erlang entities which is the way to  
> kill overheads and make data accessible really fast (if we know the  
> key, ofc).
> The problem I'm facing is full-text search, as you might guess.
> I don't really want to use any external indexers or databases because  
> I want that piece of software to be scalable node-wise and don't want  
> to add any more data storage entites to the system to prevent data  
> duplication and to ensure consistency.
> So what I want to do is to "invent bicycle" by implementing a full  
> text search on a key-value storage.
> 
> Here I'll outline my plan for writing that feature and I'd be happy to  
> hear criticism.
> 
> A simplistic way to think of a full text search engine is a strict  
> text search engine, when the result of a search is a list of messages  
> (posts) that contain full search keywords (if we search for "abc" only  
> "abc" words are matched, so that "abcd" won't match).
> In order to accomplish that I can think of the following things to do:
> 1) We make a set of all words W
> 2) For each w \in W we have an "index" I_w which returns a list of all  
> posts that contain w
> 
> When we add a pure full text search functionality (indeed, we might  
> want "geology" to match "geologist") we do the following:
> 1) make a set of normalized words W_n
> 2) make a set of common prefixes P
> 3) make a set of common suffixes S
> 4) define a normalization function f that strips common suffixes and  
> prefixes from a word
> 5) define a matching function m that maps f(w) to a list of words from  
> the W_n set
> 5.1)
> 6) for each w in W_n we keep an "index" W_n that returns all the  
> messages M for which \exists x \in M : m(f(x)) = w
> 7) for each word m in new message
> 
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions



More information about the erlang-questions mailing list