[erlang-questions] Full text search in KV-storage.

Sat Sep 24 19:45:09 CEST 2011

I implemented this on Redis with node.js last year. It was pretty easy, and
worked OK, but it wasn't Amazon Product Search by any means.

The mapping from word -> document is going to be pretty big in RAM, because
the value for each key needs to be your unique identifier for each document
-- I imagine you don't want to die when you reach 4 billion posts.
(Actually, on Erlang, I think the limit for small ints is 28 bits, or about
256 million)

Most sites that operate these things for real, though, just use some
off-the-shelf index, like Apache Lucene or whatever (Solr wraps that pretty
well). Yes, it's another system, but it's a system that already works and
scales :-)
Or they delegate to Google Site Search.

Sincerely,

jw

--
Americans might object: there is no way we would sacrifice our living
standards for the benefit of people in the rest of the world. Nevertheless,
whether we get there willingly or not, we shall soon have lower consumption
rates, because our present rates are unsustainable.

On Fri, Sep 23, 2011 at 10:42 AM, Oleg Chernyh <erlang@REDACTED> wrote:

> I'm writing a forum engine that can sustain high amount of users making
> queries at the same time.
> I'm using a KV storage that stores erlang entities which is the way to kill
> overheads and make data accessible really fast (if we know the key, ofc).
> The problem I'm facing is full-text search, as you might guess.
> I don't really want to use any external indexers or databases because I
> want that piece of software to be scalable node-wise and don't want to add
> any more data storage entites to the system to prevent data duplication and
> to ensure consistency.
> So what I want to do is to "invent bicycle" by implementing a full text
> search on a key-value storage.
>
> Here I'll outline my plan for writing that feature and I'd be happy to hear
> criticism.
>
> A simplistic way to think of a full text search engine is a strict text
> search engine, when the result of a search is a list of messages (posts)
> that contain full search keywords (if we search for "abc" only "abc" words
> are matched, so that "abcd" won't match).
> In order to accomplish that I can think of the following things to do:
> 1) We make a set of all words W
> 2) For each w \in W we have an "index" I_w which returns a list of all
> posts that contain w
>
> When we add a pure full text search functionality (indeed, we might want
> "geology" to match "geologist") we do the following:
> 1) make a set of normalized words W_n
> 2) make a set of common prefixes P
> 3) make a set of common suffixes S
> 4) define a normalization function f that strips common suffixes and
> prefixes from a word
> 5) define a matching function m that maps f(w) to a list of words from the
> W_n set
> 5.1) if there is no adequate entry for f(w), m function adds f(w) to W_n
> 6) for each w in W_n we keep an "index" W_n that returns all the messages M
> for which \exists x \in M : m(f(x)) = w
> 7) for each word x in new message we asynchronously run m(f(x)) and update
> indexes appropriately
> 8) from time to time we re-index our W_n maps merging indexes for similar
> words
>
> What are the pitfalls of that approach? Any feedback will be appreciated.
> -
> Oleg
>
>
> ______________________________**_________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/**listinfo/erlang-questions<http://erlang.org/mailman/listinfo/erlang-questions>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20110924/9d8b1605/attachment.htm>