dets improvements

Thu Jun 15 20:30:18 CEST 2006

>Are there any databased gurus out there than know how the rebuilding
>cost in dets compares to the disc storage in
>MySQL/Postgres/Oracle/etc?

These are WAL based ARIES database systems (well, I'm not sure about
MySQL myisam tables, but the innodbs are.)  Changes destined for
persistent storage are first placed into a write ahead log so that
they are durable.  "Rebuilding" after an unexpected interruption is
typically a matter of placing logged persistent changes into the
database file(s) (and backing out uncommitted changes.)   It does not
typically require a full sequential sweep and rebuild.   When things
get to that point, it's time to start pulling out backups and rolling
forward journals.

>> The major problem with Dets as I see it is that the memory allocation
>> scheme (a buddy system) is kept in RAM. For a fragmented table with
>> millions of objects, the RAM data can amount to several megabytes.
>> When closing or syncing a table, this (possibly huge) data structure
>> has to be written to disc.

>Is there any way to 1) measure the level of this fragmentation and 2)
>to manage/reduce it via maintenance operations (preferably, without
>taking offline the whole database/table)? I think this monitoring and
>maintentance aspect needs to be documented somewhere, because there
>will always be a fear that it may run out of control.

1) I do this by finding the pid of the dets:open_file_loop2 and using
erlang:process_info(), which should give you an idea of the memory
consumption associated with an open dets table.  These tests can be
made systematically.  Check the heap size and also the process
dictionary, which dets apparently uses to store file segments.

2) From the documentation:
"The current implementation keeps the entire buddy system in RAM,
which implies that if the table gets heavily fragmented, quite some
memory can be used up. The only way to defragment a table is to close
it and then open it again with the repair option set to force."

The table is unavailable during the file_open() with {repair, force}
set.   As discussed, this operation is fairly heavy weight.   Somewhat
similar to a Postgresql/MVCC vacuum sweep but of course those systems
remain online.

It might be possible to do this manually, at run-time, by using the bchunk()/
init_table() mechanism -- or some other traversal (maybe traverse())   The docs
say mnesia uses this, so the answer might be there.  I have no direct
experience with bchunk, so I'm not certain of the overhead, but you
can bet that a traverse()/insert() construct will be quite expensive
relative to a sequential rebuild.

---

Yariv-Keep in mind that some rdbms systems actually keep blobs OUTSIDE
of the paged files, and instead store them as files in a directory
tree on the host filesystem.  Some systems that even support TOAST
actually keep the large objects in ONE table per database, which
obviously does not scale well.   Keeping large binary objects in a
paged transactional database incurs a -significant- amount of i/o
overhead relative to just writing the binary assets on a filesystem
and managing them some other way (and doesn't Erlang give us so many
useful mechanisms for designing these management frameworks?)
There are just so many valid arguments against shoehorning large
blobs in a paged transactional database (IMO.)

Back to Erlang, though, there are some other detriments regarding
using large dets tables (even fragmented) with mnesia.   There are
some detriments for using even plain vanilla dets even outside of
mnesia, but many of these can be worked around.  If you're looking at
supporting large datasets in a real system then there is a very real
possibility that you will find yourself in the situation of
implementing creative solutions to work around some of the
limitations.