<div>I implemented this on Redis with node.js last year. It was pretty easy, and worked OK, but it wasn't Amazon Product Search by any means.</div><div> </div><div>The mapping from word -> document is going to be pretty big in RAM, because the value for each key needs to be your unique identifier for each document -- I imagine you don't want to die when you reach 4 billion posts. (Actually, on Erlang, I think the limit for small ints is 28 bits, or about 256 million)</div>
<div> </div><div>Most sites that operate these things for real, though, just use some off-the-shelf index, like Apache Lucene or whatever (Solr wraps that pretty well). Yes, it's another system, but it's a system that already works and scales :-)</div>
<div>Or they delegate to Google Site Search.</div><div> </div><div>Sincerely,</div><div> </div><div>jw</div><div> </div><div><br clear="all"><br>--<br>Americans might object: there is no way we would sacrifice our living standards for the benefit of people in the rest of the world. Nevertheless, whether we get there willingly or not, we shall soon have lower consumption rates, because our present rates are unsustainable. <br>
<br>
<br><br></div><div class="gmail_quote">On Fri, Sep 23, 2011 at 10:42 AM, Oleg Chernyh <span dir="ltr"><<a href="mailto:erlang@udaff.com">erlang@udaff.com</a>></span> wrote:<br><blockquote style="margin: 0px 0px 0px 0.8ex; padding-left: 1ex; border-left-color: rgb(204, 204, 204); border-left-width: 1px; border-left-style: solid;" class="gmail_quote">
I'm writing a forum engine that can sustain high amount of users making queries at the same time.<br>
I'm using a KV storage that stores erlang entities which is the way to kill overheads and make data accessible really fast (if we know the key, ofc).<br>
The problem I'm facing is full-text search, as you might guess.<br>
I don't really want to use any external indexers or databases because I want that piece of software to be scalable node-wise and don't want to add any more data storage entites to the system to prevent data duplication and to ensure consistency.<br>
So what I want to do is to "invent bicycle" by implementing a full text search on a key-value storage.<br>
<br>
Here I'll outline my plan for writing that feature and I'd be happy to hear criticism.<br>
<br>
A simplistic way to think of a full text search engine is a strict text search engine, when the result of a search is a list of messages (posts) that contain full search keywords (if we search for "abc" only "abc" words are matched, so that "abcd" won't match).<br>
In order to accomplish that I can think of the following things to do:<br>
1) We make a set of all words W<br>
2) For each w \in W we have an "index" I_w which returns a list of all posts that contain w<br>
<br>
When we add a pure full text search functionality (indeed, we might want "geology" to match "geologist") we do the following:<br>
1) make a set of normalized words W_n<br>
2) make a set of common prefixes P<br>
3) make a set of common suffixes S<br>
4) define a normalization function f that strips common suffixes and prefixes from a word<br>
5) define a matching function m that maps f(w) to a list of words from the W_n set<br>
5.1) if there is no adequate entry for f(w), m function adds f(w) to W_n<br>
6) for each w in W_n we keep an "index" W_n that returns all the messages M for which \exists x \in M : m(f(x)) = w<br>
7) for each word x in new message we asynchronously run m(f(x)) and update indexes appropriately<br>
8) from time to time we re-index our W_n maps merging indexes for similar words<br>
<br>
What are the pitfalls of that approach? Any feedback will be appreciated.<br>
-<br>
Oleg<br>
<br>
<br>
______________________________<u></u>_________________<br>
erlang-questions mailing list<br>
<a href="mailto:erlang-questions@erlang.org" target="_blank">erlang-questions@erlang.org</a><br>
<a href="http://erlang.org/mailman/listinfo/erlang-questions" target="_blank">http://erlang.org/mailman/<u></u>listinfo/erlang-questions</a><br>
</blockquote></div><br>