[erlang-questions] Full text search in KV-storage.
Oleg Chernyh
erlang@REDACTED
Fri Sep 23 19:42:43 CEST 2011
I'm writing a forum engine that can sustain high amount of users
making queries at the same time.
I'm using a KV storage that stores erlang entities which is the way to
kill overheads and make data accessible really fast (if we know the
key, ofc).
The problem I'm facing is full-text search, as you might guess.
I don't really want to use any external indexers or databases because
I want that piece of software to be scalable node-wise and don't want
to add any more data storage entites to the system to prevent data
duplication and to ensure consistency.
So what I want to do is to "invent bicycle" by implementing a full
text search on a key-value storage.
Here I'll outline my plan for writing that feature and I'd be happy to
hear criticism.
A simplistic way to think of a full text search engine is a strict
text search engine, when the result of a search is a list of messages
(posts) that contain full search keywords (if we search for "abc" only
"abc" words are matched, so that "abcd" won't match).
In order to accomplish that I can think of the following things to do:
1) We make a set of all words W
2) For each w \in W we have an "index" I_w which returns a list of all
posts that contain w
When we add a pure full text search functionality (indeed, we might
want "geology" to match "geologist") we do the following:
1) make a set of normalized words W_n
2) make a set of common prefixes P
3) make a set of common suffixes S
4) define a normalization function f that strips common suffixes and
prefixes from a word
5) define a matching function m that maps f(w) to a list of words from
the W_n set
5.1) if there is no adequate entry for f(w), m function adds f(w) to W_n
6) for each w in W_n we keep an "index" W_n that returns all the
messages M for which \exists x \in M : m(f(x)) = w
7) for each word x in new message we asynchronously run m(f(x)) and
update indexes appropriately
8) from time to time we re-index our W_n maps merging indexes for
similar words
What are the pitfalls of that approach? Any feedback will be appreciated.
-
Oleg
More information about the erlang-questions
mailing list