[erlang-questions] tcerl memory usage, plans for hash-type db
Scott Lystig Fritchie
Fri Aug 22 21:57:09 CEST 2008
Hi, all. Sorry for adding my $0.02 a week or so late. I can claim only
observations and comments, since I haven't used TC or mnesia_ext
myself. However, I have been helping a colleague who has used both and
has been finishing a mnesia_ext backend for Berkeley DB. (Yes, we hope
to open source it, but (I made a verb!) no decisions yet, sorry.)
Paul Mineiro <> wrote:
pm> Re: sync times, well a couple of seconds is perhaps not out of line,
pm> depending upon the amount of data.
I have the same difficulty with a Berkeley DB application. A customer
is very sensitive about maximum latency of the app: average latency must
be under 1/4 second, and absolutely no maximum latency above 1/3
second. Fairly harsh expectations.
Berkeley DB's environment can maintain a page cache. All dirty pages
(or fractions of pages) are written via "write-ahead" to the transaction
log, synchronously by default. But the dirty hash/B-tree page in the
cache will not be written to the hash/B-tree file until there is:
1. a cache eviction under pressure
2. a checkpoint (which forces all dirty pages to be written)
#1 can't be helped, short of increasing the size of the cache or
reducing the number of pages that you modify.
#2 is an atomic weapon bomb blast if you have a multi-gigabyte B-tree
table file that has tens of thousands (or more) of dirty pages. DB
provides a function to flush X percent of dirty pages
(DB_ENV->memp_trickle), but if you have a 1GB memory pool and a table
using 4KB pages, 1% is still 2,622 pages. You can sleep between such 1%
flush incrememnts, but...
... flushing even a "tiny" number of pages like that, with quite random
page placement (as far as the file system is concerned), and then a 1/3
second latency ceiling feels very, very low. :-(
Sorry about the digression, but if TokyoCabinet has a similar page cache
and flushing strategy, then you may be out of luck. TC may have better
chances of adding a "flush cache but only flush N pages/sec" feature.
That will actually increase the wall-clock time for a full cache flush,
but if you've got zillions of dirty pages and you really need a full
flush, you don't have much choice. (See "more-background-info" at the
pm> As you might already have found,
pm> I don't get the durability story of tokyocabinet
pm> but we use distributed mnesia and have multiple copies of any table
pm> so we basically never locally load a table on restart.
Paul, if your tables are big enough, then even a load from a remote host
will incur lots of disk I/O. Is that a problem you've had to worry
about already, or something lurking in the future?
I've only taken glimpse at mnesia_ext, but the lack of coordination with
Mnesia's transaction manager is A Problem, in my (still naive) opinion.
Even if TC's disk structures were 100% crash-safe, aren't there still
sync problems between the local txn manager's log and any mnesia_ext
backend (as the callbacks exist today)?
Here's some info that I got from a Dan Cutting talk about the Lucene
indexer(*). He's talking about seek-based databases (based on B-trees)
versus "old-school" merge-sorting (that Lucense uses).
Assume: If 10MB/s xfer, 10ms seek, 1% of pages require update
in a 1TB database.
100B entries, 10kB pages, 10B entries, 1B pages
@ 1 seek per update requires 1000 days!
@ 1 seek per page requires 100 days! (assumes multiple
@ transfer entire db takes 1 day (via sequential read of 1 file)
Summary: Updating B-trees will never scale well using Winchester disks.
In an interview in ACM Queue (in 2003?), Jim Gray said:
"... programmers have to start thinking of the disk as a sequential
device rather than a random access device."
In a pithier form, I've read folks saying, "Disk is the new tape."
(*) Er, it was a summer 2008 presentation to a conference in Helsinki.
I forget the name of the conference, but his presentation slides and
video are available online (not YouTube/Google Video, but I may be
More information about the erlang-questions