[erlang-questions] Search engine in Erlang

Tue Aug 28 00:34:21 CEST 2007

2007/8/27, Joel Reymont <joelr1@REDACTED>:
> > So how many documents, how many terms? distributed on how many
> > machines?
> Something built for a potentially large social network.

It is really difficult to guesstimate what is important, but easily
scaling by adding more machines is desirable. The winning approach is
to partition the documents to be indexed over the machines, instead of
partitioning the terms.

The way you order the posting list for a term is important, for
example, if you order by most-recently at the top you will be more
efficient at querying for the last N documents containing a term,
since it takes less IO.

Similarly, if you are going to get results out in some
relevancy-order, you want to have a posting list presorted so most
relevant documents are first. But then you also want to include the
relevancy measure of the document in the posting list. This allows you
to merge posting lists into one posting list still sorted by relevancy
cheaply.

As there is a tradeoff between cheap-to-write and cheap-to-read you
also want to have two different posting lists specialized for each, so
you can delay updating the compact cheap-to-read posting list by
posting documents to the cheap-to-write lists until it is crowded
or old. Lucene uses this to speed up adding documents. Oh, this
cheap-to-write store needs a way to indicate that a document has been
removed.

These are the tricks i know of for inverted indexes. The terminology i
use is that term is an integer mapped to a search word/tag, document
is an integer mapped to an object you want to find, and posting list
is the term and list of documents where it occurs.