Hi, Oleg,<br><br>Your idea may work pretty well, but there are few points you should consider:<br>1. at high number of posts, you may want to switch from int (the numbering the posts) in something more roomy;<br>2. if you build such a system even for one language, the dictionary database will be enormous and some parts will contain no posts (therefore, you should start ranking your dictionary entries, so, the least important to be at the end and not necessary loaded in RAM);<br>

3. such a mapping implies that you know that language pretty well (pretty close to the level of a linguist);<br>4. I know no language which doesn't have differences in between the literature language and the spoken language (therefore, you may need another mapping from spoken to literature language).<br>

And there are other problems coming from such an approach.<br><br>As Jesper said, an AI/ML approach would help to improve your approach, if I understood his message correctly. That means, in a simplistic way of putting it, do not create a vocabulary from the beginning, but allow vocabulary entries with ranking on them. Of course, that will not help you at all in finding meaningful prefixes and suffixes, but it will help you in retrieving faster the posts as many people will hit the peak of your ranking system. This approach will also allow humans to decide what is more relevant and what is less relevant (let's not forget human languages are quite illogic and, therefore, inventing algorithms to mimic human languages is not so easy).<br>

<br>I would also add a system to increment the post number which should allow expansion beyond the basic data types sizes (maybe pagination wouldn't be a bad idea).<br><br>Well, that's my 2c opinion. I hope it will help you in thinking your idea.<br>

<br>Good luck!<br>CGS<br><br><br><br><div class="gmail_quote">On Sat, Sep 24, 2011 at 5:27 PM, Jesper Louis Andersen <span dir="ltr"><<a href="mailto:jesper.louis.andersen@gmail.com">jesper.louis.andersen@gmail.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div class="im">On Sat, Sep 24, 2011 at 01:58, Oleg Chernyh <<a href="mailto:erlang@udaff.com">erlang@udaff.com</a>> wrote:<br>


<br>

> I'm far from linguistics and full text search engines, do your input was<br>

> particulary useful.<br>

> And what about my idea that I have briefly described?<br>

<br>

</div>I tried to be adept and avoid answering those parts because I know<br>

very little about it. It looks like you do some stemming of words to<br>

find their stem and then you index those. But I don't know what you do<br>

to achieve it. If you pick a language, like english there are probably<br>

two ways to go: 1. Form a set of rules and apply those rules to<br>

"normalize"/"canonicalize"/"extract-the-stem". 2. Play the google<br>

game: If you have enough data to mine statistically, figure out what<br>

the stems are via machine learning.<br>

<br>

Historically, judging by papers at AI/ML conferences, it looks like<br>

option 2 wins :P<br>

<font color="#888888"><br>

--<br>

J.<br>

_______________________________________________<br>

</font><div><div></div><div class="h5">erlang-questions mailing list<br>

<a href="mailto:erlang-questions@erlang.org">erlang-questions@erlang.org</a><br>

<a href="http://erlang.org/mailman/listinfo/erlang-questions" target="_blank">http://erlang.org/mailman/listinfo/erlang-questions</a><br>

</div></div></blockquote></div><br>