[erlang-questions] tcerl memory usage, plans for hash-type db

Fri Aug 22 21:57:09 CEST 2008

Hi, all.  Sorry for adding my $0.02 a week or so late.  I can claim only
observations and comments, since I haven't used TC or mnesia_ext
myself.  However, I have been helping a colleague who has used both and
has been finishing a mnesia_ext backend for Berkeley DB.  (Yes, we hope
to open source it, but (I made a verb!) no decisions yet, sorry.)

Paul Mineiro <paul-trapexit@REDACTED> wrote:

pm> Re: sync times, well a couple of seconds is perhaps not out of line,
pm> depending upon the amount of data.

I have the same difficulty with a Berkeley DB application.  A customer
is very sensitive about maximum latency of the app: average latency must
be under 1/4 second, and absolutely no maximum latency above 1/3
second.  Fairly harsh expectations.

Berkeley DB's environment can maintain a page cache.  All dirty pages
(or fractions of pages) are written via "write-ahead" to the transaction
log, synchronously by default.  But the dirty hash/B-tree page in the
cache will not be written to the hash/B-tree file until there is:
   1. a cache eviction under pressure
   2. a checkpoint (which forces all dirty pages to be written)

#1 can't be helped, short of increasing the size of the cache or
reducing the number of pages that you modify.

#2 is an atomic weapon bomb blast if you have a multi-gigabyte B-tree
table file that has tens of thousands (or more) of dirty pages.  DB
provides a function to flush X percent of dirty pages
(DB_ENV->memp_trickle), but if you have a 1GB memory pool and a table
using 4KB pages, 1% is still 2,622 pages.  You can sleep between such 1%
flush incrememnts, but...

... flushing even a "tiny" number of pages like that, with quite random
page placement (as far as the file system is concerned), and then a 1/3
second latency ceiling feels very, very low.  :-(

Sorry about the digression, but if TokyoCabinet has a similar page cache
and flushing strategy, then you may be out of luck.  TC may have better
chances of adding a "flush cache but only flush N pages/sec" feature.
That will actually increase the wall-clock time for a full cache flush,
but if you've got zillions of dirty pages and you really need a full
flush, you don't have much choice.  (See "more-background-info" at the
end.)

pm> As you might already have found,
pm> I don't get the durability story of tokyocabinet
pm> (http://groups.google.com/group/mnesiaex-discuss/browse_thread/thread/da2ae1da862b01c0)
pm> but we use distributed mnesia and have multiple copies of any table
pm> so we basically never locally load a table on restart.

Paul, if your tables are big enough, then even a load from a remote host
will incur lots of disk I/O.  Is that a problem you've had to worry
about already, or something lurking in the future?

I've only taken glimpse at mnesia_ext, but the lack of coordination with
Mnesia's transaction manager is A Problem, in my (still naive) opinion.
Even if TC's disk structures were 100% crash-safe, aren't there still
sync problems between the local txn manager's log and any mnesia_ext
backend (as the callbacks exist today)?

-Scott

<more-background-info>

Here's some info that I got from a Dan Cutting talk about the Lucene
indexer(*).  He's talking about seek-based databases (based on B-trees)
versus "old-school" merge-sorting (that Lucense uses).

    Assume: If 10MB/s xfer, 10ms seek, 1% of pages require update
            in a 1TB database.
            100B entries, 10kB pages, 10B entries, 1B pages
    Then:
      @ 1 seek per update requires 1000 days!
      @ 1 seek per page requires 100 days! (assumes multiple
            updates/page)
      @ transfer entire db takes 1 day (via sequential read of 1 file)

Summary: Updating B-trees will never scale well using Winchester disks.

In an interview in ACM Queue (in 2003?), Jim Gray said:

  "... programmers have to start thinking of the disk as a sequential
  device rather than a random access device."

In a pithier form, I've read folks saying, "Disk is the new tape."

(*) Er, it was a summer 2008 presentation to a conference in Helsinki.
I forget the name of the conference, but his presentation slides and
video are available online (not YouTube/Google Video, but I may be
mistaken).

</more-background-info>