# [erlang-questions] Full text search in KV-storage.

Oleg Chernyh erlang@REDACTED
Sat Sep 24 01:40:04 CEST 2011

I actually had a look at Riak, I might even give it a try at some other project of mine.
Thank you for your input anyway.

From: "Ryan Zezeski" <rzezeski@REDACTED>
Date: Sat, Sep 24, 2011 1:27 am
Subject: [erlang-questions] Full text search in KV-storage.
To: "Oleg Chernyh" <erlang@REDACTED>
Cc: <erlang-questions@REDACTED>

Oleg,

I know you said you don't want to use an external solution but perhaps Riak
[1] is an exception given it's design to scale out and the fact that it's
written in Erlang?  It has full-text search capability [2].  Might be worth
a look.

-Ryan

[1]: http://wiki.basho.com/Riak.html

[2]: http://wiki.basho.com/Riak-Search.html

On Fri, Sep 23, 2011 at 12:19 PM, Oleg Chernyh <erlang@REDACTED> wrote:

> I'm writing a forum engine that can sustain high amount of users making
> queries at the same time.
> I'm using a KV storage that stores erlang entities which is the way to kill
> overheads and make data accessible really fast (if we know the key, ofc).
> The problem I'm facing is full-text search, as you might guess.
> I don't really want to use any external indexers or databases because I
> want that piece of software to be scalable node-wise and don't want to add
> any more data storage entites to the system to prevent data duplication and
> to ensure consistency.
> So what I want to do is to "invent bicycle" by implementing a full text
> search on a key-value storage.
>
> Here I'll outline my plan for writing that feature and I'd be happy to hear
> criticism.
>
> A simplistic way to think of a full text search engine is a strict text
> search engine, when the result of a search is a list of messages (posts)
> that contain full search keywords (if we search for "abc" only "abc" words
> are matched, so that "abcd" won't match).
> In order to accomplish that I can think of the following things to do:
> 1) We make a set of all words W
> 2) For each w \in W we have an "index" I_w which returns a list of all
> posts that contain w
>
> When we add a pure full text search functionality (indeed, we might want
> "geology" to match "geologist") we do the following:
> 1) make a set of normalized words W_n
> 2) make a set of common prefixes P
> 3) make a set of common suffixes S
> 4) define a normalization function f that strips common suffixes and
> prefixes from a word
> 5) define a matching function m that maps f(w) to a list of words from the
> W_n set
> 5.1)
> 6) for each w in W_n we keep an "index" W_n that returns all the messages M
> for which \exists x \in M : m(f(x)) = w
> 7) for each word m in new message
>
> ______________________________**_________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/**listinfo/erlang-questions<http://erlang.org/mailman/listinfo/erlang-questions>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20110924/e0e43a4d/attachment.htm>