Jesper, thank you for the insights.<br>I'm far from linguistics and full text search engines, do your input was particulary useful.<br>And what about my idea that I have briefly described?<br>I'm sure that it has many flaws such as m-lookup and re-indexing problem, but that's how I store non-full text search indexes at the moment such as tag searches and searches by author. So my mind was spinning around the idea of homogeneous O(1) indexes.<br>To those who suggest using 3rd party storages - too much of the current logic relies on a "native" storage I currently use so I'd be glad to stick with it.<br><br>----- Reply message -----<br>From: "Jesper Louis Andersen" <jesper.louis.andersen@gmail.com><br>Date: Fri, Sep 23, 2011 7:58 pm<br>Subject: [erlang-questions] Full text search in KV-storage.<br>To: "Oleg Chernyh" <erlang@udaff.com><br>Cc: <erlang-questions@erlang.org><br><br><br>On Fri, Sep 23, 2011 at 18:19, Oleg Chernyh <erlang@udaff.com> wrote:<br><br><br>> 1) We make a set of all words W<br><br>Two structures immediately suggest themselves:<br><br>* Tries. That is store a digital tree on the word. "CAT" will search<br>in a 26-way tree through the C-edge, the A-edge and the T-edge. If you<br>need a base 62 or a base 36 match, it may be beneficial to compress<br>nodes that are not totally full. An evil way is to use an ets table<br>with the compressed flag set and call ets:lookup_element/3 on the<br>beast. This is how a spelling checker often stores its dictionary.<br><br>* Bloom filter based structures. Look at how postgresql does it with<br>GiST indexes for instance:<br><br><a href="http://www.postgresql.org/docs/8.3/static/textsearch-indexes.html">http://www.postgresql.org/docs/8.3/static/textsearch-indexes.html</a><br><br>For each w \in document, you hash word, h(w) = n, and set the n'th bit<br>in a bit string. Then you or all hashes together \/_{w \in D} h(w).<br>This gives an approximate match, but hopefully it weeds out a lot of<br>things you need to search and the amount of false positives is<br>probably small. These bit strings can be stored as e.g., crit-bit<br>trees (also called radix or patricia trees - another trie-structure,<br>but this time binary). See for instance,<br><br><a href="http://cr.yp.to/critbit.html">http://cr.yp.to/critbit.html</a> by Dan Bernstein.<br><br>I did a crit-bit tree structure in Erlang some months ago,<br><a href="https://github.com/jlouis/bench-map/blob/master/src/patricia.erl">https://github.com/jlouis/bench-map/blob/master/src/patricia.erl</a> and<br>also one based on Hash-array Mapped Tries (HAMT, the basic data<br>structure in Clojures dictionaries),<br><a href="https://github.com/jlouis/bench-map/blob/master/src/hamt.erl">https://github.com/jlouis/bench-map/blob/master/src/hamt.erl</a> .. While<br>they do not outperform dict, gb_trees and Robert Virdings RB trees for<br>the workloads I've tried, the performance isn't too bad and they will<br>probably pack much better for this kind of thing.<br><br>Do note however that this is quite specialized and you can probably<br>win some raw speed by using a pre-baked solution. My suggestions<br>should take you up in speed by quite a lot, however.<br><br>-- <br>J.<br><br><br>