[erlang-questions] very large key lookup

Mon Sep 25 22:16:20 CEST 2006

>>>>> "uw" == Ulf Wiger \(TN/EAB\) <Ulf> writes:

uw> ... but I fail to find how to efficiently match binaries in a
uw> match spec. This is a shame.

I've looked for such a thing, too, but haven't found it.  The only
efficient match spec is an equality/inequality test of the entire
binary.

My only guess for the omission is that the ETS & Mnesia match spec
stuff was implemented well before the introduction of the binary data
type.  Correct, or at least near the mark?

I've been wanting something more than the ability to use hd() and tl()
in the match spec.  If you don't know exactly how long the list is,
however, other options within the confines of the match spec language
are hard/impossible? to find.

I briefly contemplated a chain of boolean or clauses testing hd('$1'),
hd(tl('$1')), hd(tl(tl('$1'))), ... etc.  I confess that I didn't
actually test doing that, but the mere thought of doing it 30 or 40
times seemed quite repulsive.  (And my app doesn't always guarantee a
max list length of 30 or 40, so I'd be vulnerable to a miss anyway.)

I've seen conflicting results when using mnesia:select between the two
constructs when the 'sec_attr' attribute has a secondary index defined
on it.  (This is off the top of my head, beware of typos.)

    match spec                                guard list
    #some_record{sec_attr = 42,   _ = '_'}      []
    #some_record{sec_attr = '$1', _ = '_'}      [{'=:=', '$1', 42}]

I haven't found a pattern to this, and I've been pressed for time
lately, so I'm ignoring it for now.  If/when I get enough info to
pester the list, I will.  :-)

In a 'set' table with 800K or so #some_record entries, and a secondary
index on 'sec_attr'(*), the table load time is the killer: a 2.8GHz
Opteron box takes over 5 minutes to load it.  Without the index, 30-40
seconds.  That load time is not a good thing.  I could fragment the
table, but that wouldn't change the overall load time much.  :-(  So,
I'll be doing the indexing for that attribute myself.

-Scott

(*) There are about 100 unique value for that attribute, so an 800K
entry table would have approx. 8K entries per 'bag' entry for the
secondary index.  So I'm not surprised it's slow.