[erlang-questions] Full text search in KV-storage.

Sat Sep 24 23:04:34 CEST 2011

Hi, Oleg,

Your idea may work pretty well, but there are few points you should
consider:
1. at high number of posts, you may want to switch from int (the numbering
the posts) in something more roomy;
2. if you build such a system even for one language, the dictionary database
will be enormous and some parts will contain no posts (therefore, you should
start ranking your dictionary entries, so, the least important to be at the
end and not necessary loaded in RAM);
3. such a mapping implies that you know that language pretty well (pretty
close to the level of a linguist);
4. I know no language which doesn't have differences in between the
literature language and the spoken language (therefore, you may need another
mapping from spoken to literature language).
And there are other problems coming from such an approach.

As Jesper said, an AI/ML approach would help to improve your approach, if I
understood his message correctly. That means, in a simplistic way of putting
it, do not create a vocabulary from the beginning, but allow vocabulary
entries with ranking on them. Of course, that will not help you at all in
finding meaningful prefixes and suffixes, but it will help you in retrieving
faster the posts as many people will hit the peak of your ranking system.
This approach will also allow humans to decide what is more relevant and
what is less relevant (let's not forget human languages are quite illogic
and, therefore, inventing algorithms to mimic human languages is not so
easy).

I would also add a system to increment the post number which should allow
expansion beyond the basic data types sizes (maybe pagination wouldn't be a
bad idea).

Well, that's my 2c opinion. I hope it will help you in thinking your idea.

Good luck!
CGS

On Sat, Sep 24, 2011 at 5:27 PM, Jesper Louis Andersen <
jesper.louis.andersen@REDACTED> wrote:

> On Sat, Sep 24, 2011 at 01:58, Oleg Chernyh <erlang@REDACTED> wrote:
>
> > I'm far from linguistics and full text search engines, do your input was
> > particulary useful.
> > And what about my idea that I have briefly described?
>
> I tried to be adept and avoid answering those parts because I know
> very little about it. It looks like you do some stemming of words to
> find their stem and then you index those. But I don't know what you do
> to achieve it. If you pick a language, like english there are probably
> two ways to go: 1. Form a set of rules and apply those rules to
> "normalize"/"canonicalize"/"extract-the-stem". 2. Play the google
> game: If you have enough data to mine statistically, figure out what
> the stems are via machine learning.
>
> Historically, judging by papers at AI/ML conferences, it looks like
> option 2 wins :P
>
> --
> J.
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20110924/5d4d22d7/attachment.htm>