[erlang-questions] Full text search in KV-storage.

Sat Sep 24 01:11:41 CEST 2011

And if you really wanted to do everything the hard way, you could
probably manage to re-use some of the components that are in Riak.
Would be far simpler to just use Riak as-is though.

On Fri, Sep 23, 2011 at 3:27 PM, Ryan Zezeski <rzezeski@REDACTED> wrote:
> Oleg,
> I know you said you don't want to use an external solution but perhaps Riak
> [1] is an exception given it's design to scale out and the fact that it's
> written in Erlang?  It has full-text search capability [2].  Might be worth
> a look.
> -Ryan
>
> [1]: http://wiki.basho.com/Riak.html
> [2]: http://wiki.basho.com/Riak-Search.html
> On Fri, Sep 23, 2011 at 12:19 PM, Oleg Chernyh <erlang@REDACTED> wrote:
>>
>> I'm writing a forum engine that can sustain high amount of users making
>> queries at the same time.
>> I'm using a KV storage that stores erlang entities which is the way to
>> kill overheads and make data accessible really fast (if we know the key,
>> ofc).
>> The problem I'm facing is full-text search, as you might guess.
>> I don't really want to use any external indexers or databases because I
>> want that piece of software to be scalable node-wise and don't want to add
>> any more data storage entites to the system to prevent data duplication and
>> to ensure consistency.
>> So what I want to do is to "invent bicycle" by implementing a full text
>> search on a key-value storage.
>>
>> Here I'll outline my plan for writing that feature and I'd be happy to
>> hear criticism.
>>
>> A simplistic way to think of a full text search engine is a strict text
>> search engine, when the result of a search is a list of messages (posts)
>> that contain full search keywords (if we search for "abc" only "abc" words
>> are matched, so that "abcd" won't match).
>> In order to accomplish that I can think of the following things to do:
>> 1) We make a set of all words W
>> 2) For each w \in W we have an "index" I_w which returns a list of all
>> posts that contain w
>>
>> When we add a pure full text search functionality (indeed, we might want
>> "geology" to match "geologist") we do the following:
>> 1) make a set of normalized words W_n
>> 2) make a set of common prefixes P
>> 3) make a set of common suffixes S
>> 4) define a normalization function f that strips common suffixes and
>> prefixes from a word
>> 5) define a matching function m that maps f(w) to a list of words from the
>> W_n set
>> 5.1)
>> 6) for each w in W_n we keep an "index" W_n that returns all the messages
>> M for which \exists x \in M : m(f(x)) = w
>> 7) for each word m in new message
>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>
>