mnesia text search ?

Mon Apr 24 13:05:38 CEST 2006

ke han wrote:
> 
> I have a table named product and want to be able to find all 
> product records where fields name, description and customerId 
> are searched for inclusion of a string.  That is, regexp 
> match in its simplest form.
> 
> I don't care if the current implementation is efficient.  I 
> just need a simple way to get it done.
> 
> BTW, I notice that Ulf's rdbms has some searching features.  
> Is this code in a working state?  Any examples?

There is a test suite, and I have some beta testers.
I'm currently revising the indexing callback structure
in order to allow an "indexing pipeline" (being able to
index on derived info, such as a parsed form of wiki
text, etc.) The change will affect the second argument
of the indexing callback (it will become a proplist,
from which you can fetch the results of intermediate
steps. One effect will be that any attribute index
will be able to fetch the entire object through the
proplist, since element(Pos, Obj) will become one 
such intermediate calculation step.)

Rdbms contains some modules for free text searches, but
they haven't been properly integrated yet. My hope is 
that it will eventually support word stem indexes. The
current challenge is how to handle frequency values, 
since they are not local to the object.

If you're looking for ways to do word lookups, then
here's an example from the rdbms test suite, slightly 
cleaned up and with a few explanatory comments:

rdbms_ix(Config) when is_list(Config) ->
    {atomic,ok} = rdbms:create_table(
                    ix1,
                    [{disc_copies, [node()]},
                     {attributes, [key, value]},
                     {rdbms, 
                      [
                       {indexes, 
                        [{{value,words},
                          ?MODULE,word_attr_ix,[],[]}]}
                      ]}
                    ]),
    Tab = ix1,
    %% create a few objects with 1-letter "words".
    %% O1 and O2 both contain the word "a".
    O1 = {Tab,1,"a b c d"},
    O2 = {Tab,2,"a b"},
    O3 = {Tab,3,"b c d"},
    O4 = {Tab,4,"e f"},
    O5 = {Tab,5,"f g"},

    trans(fun() ->
                  lists:foreach(
                    fun(O) ->
                            mnesia:write(O)
                    end, [O1,O2,O3,O4,O5])
          end),

    %% try index lookups using the word index
    trans(fun() ->
                  [O1,O2,O3] =
                      lists:sort(
                        mnesia:index_read(Tab, "b", {value,words})),
                  [O1,O3] =
                      lists:sort(
                        mnesia:index_read(Tab, "d", {value,words}))
          end).

%% the index callback
word_attr_ix(Str, _) ->
    [to_lower(W) || W <- string:tokens(Str, " \t\rn")].

to_lower(Word) ->
    lists:map(fun(C) when $A =< C, C =< $Z ->
                      $a + (C - $A);
                 (C) ->
                      C
              end, Word).

(One may observe that even this indexing function
is really a two-step process: convert to lowercase +
extract words, except it's done in the reverse order
above.)

BR,
Ulf W