[erlang-questions] Mnesia deadlock with large volume of dirty operations?

Sat Apr 3 01:00:06 CEST 2010

Well, I went ahead and deleted about 82k messages in 10k batches.

I did this over about a 15 minute period.

The good news is that the system has not crashed.

The bad news is that some of the reported sizes of the tables have grown
dangerously close to the 2GB limit and further that the tables appear to be
wholly inconsistent:

Here is a dump of my 7 nodes

node 1: offline_msg    : with 995638   records occupying 2039455556 bytes on
disc
node 2: offline_msg    : with 1015600  records occupying 2097112225 bytes on
disc
node 3: offline_msg    : with 995641   records occupying 1797758788 bytes on
disc
node 4: offline_msg    : with 1015204  records occupying 2096658267 bytes on
disc
node 5: offline_msg    : with 995615   records occupying 1776787268 bytes on
disc
node 6: offline_msg    : with 995618   records occupying 1388054291 bytes on
disc
node 7: offline_msg    : with 995611   records occupying 1388054291 bytes on
disc

before I started, the nodes were about 1.36GB on disc. Some of them are now
close to the 2GB limit.

the delete operation was initiated on node 5.  the message queue's on all of
the nodes are zero across the board. my logs are clean and give the
indication that everything proceeded normally.

i think at this point, my only recourse is to restart node 1-5  in the hopes
that they clone from 6 and 7 providing the best space reclamation....

anyone else have thoughts on the matter ?

--b

On Fri, Apr 2, 2010 at 2:14 PM, Bob Ippolito <bob@REDACTED> wrote:

> You might want to measure the message_queue_len of the mnesia_tm
> processes (on each node) to see if it's getting behind, and tune your
> waits based upon if the message queues are small/empty or not.
>
> On Fri, Apr 2, 2010 at 2:08 PM, Brian Acton <acton@REDACTED> wrote:
> > Yes. I am using dets (disc_only) tables in mnesia.
> >
> > Since I was able to delete 10k records previously. I think I am going to
> > start with a baseline of 10k record with a 60 second sleep interval.
> > Hopefully this will work successfully. I wish I knew what a more
> appropriate
> > sleep period would be as the maintenance is now going to take a very long
> > time.
> >
> > Thanks for your help,
> >
> > --b
> >
> > On Fri, Apr 2, 2010 at 1:55 PM, Dan Gudmundsson <dgud@REDACTED> wrote:
> >
> >> Well, I can't much advice, but I would definitely test this on a non
> >> live system first.
> >>
> >> mnesia:fold is not the best tool when you are changing a lot of
> >> records, it will have to keep
> >> every change in memory until you have traversed the whole table. And
> >> it is slow with lot
> >> of the changes, since have to compensate for the things you have done
> >> earlier in the transaction.
> >>
> >> I assume your are using dets (disc_only) since you are afraid of the
> >> 2G limit, or is it
> >> memory limit on windows?
> >>
> >> dets is slow, mnesia is primarly a ram database.
> >>
> >> The only way I see it is to chunk though the tables a couple 100~1000
> >> records
> >> per transaction or something.
> >> And have code that can deal with both the new and old format during the
> >> changing
> >> of the database.
> >>
> >> Good luck
> >> /Dan
> >>
> >> On Fri, Apr 2, 2010 at 10:19 PM, Brian Acton <acton@REDACTED>
> wrote:
> >> > On this particular table, I do not want to delete all entries. This is
> >> why I
> >> > posted a separate post to the mailing list. Combining the two threads
> >> back,
> >> > I want:
> >> >
> >> > One table, I want to delete entries > n days.
> >> > Another table, I want to delete all entries.
> >> >
> >> > Both tables are reasonably hot (~1-2 ops per second) and reasonably
> large
> >> (>
> >> > 1.5GB). I'm hitting the 2GB limit and I need to clean up these tables.
> >> >
> >> > So far, any attempts at maintenance (as outlined in previous emails)
> have
> >> > resulted in Mnesia seizing up and bringing down the cluster.
> >> >
> >> > It sounds like I have to do this in very small increments with wait
> time
> >> > between increments. However, I do not have a method and mechanism for
> >> > determining the size of an increment or a wait time between
> increments.
> >> I'm
> >> > fine doing ten deletes per 1 second if that's what it takes. However,
> I'd
> >> > like to be able to figure out the maximum number of deletes that I can
> do
> >> in
> >> > the minimum amount of time.
> >> >
> >> > I'm definitely open to suggestion on this.
> >> >
> >> > --b
> >> >
> >> > On Fri, Apr 2, 2010 at 1:05 PM, Dan Gudmundsson <dgud@REDACTED>
> wrote:
> >> >
> >> >> clear_table is the fastest way you can delete it, but it will take a
> >> >> while when there is a lot of data.
> >> >>
> >> >> /Dan
> >> >>
> >> >> On Fri, Apr 2, 2010 at 8:22 PM, Brian Acton <acton@REDACTED>
> wrote:
> >> >> > I'm sorry. I neglected to tell you what I had done on the previous
> >> day.
> >> >> >
> >> >> > On the previous day, I had attempted to delete some old records
> using
> >> >> this
> >> >> > methodology:
> >> >> >
> >> >> >                mnesia:write_lock_table(offline_msg),
> >> >> >                mnesia:foldl(
> >> >> >                  fun(Rec, _Acc) ->
> >> >> >                          case Rec#offline_msg.expire of
> >> >> >                              never ->
> >> >> >                                  ok;
> >> >> >                              TS ->
> >> >> >                                  if
> >> >> >                                      TS < TimeStamp ->
> >> >> >                                          mnesia:delete_object(Rec);
> >> >> >                                      true ->
> >> >> >                                          ok
> >> >> >                                  end
> >> >> >                          end
> >> >> >                  end, ok, offline_msg)
> >> >> >
> >> >> >
> >> >> > This delete finished on the 1st node but subsequently locked up all
> >> the
> >> >> > other nodes on a table lock. The cluster blew up and my 24/7
> service
> >> went
> >> >> > into 1 hr of recovery of downtime.
> >> >> >
> >> >> > So to recap,
> >> >> >
> >> >> > on day 1 - transaction start, table lock, delete objects - finished
> in
> >> >> about
> >> >> > 2 minutes
> >> >> > on day 2 - dirty select, dirty delete objects - finished in about 2
> >> >> minutes
> >> >> >
> >> >> > In both cases, the cluster blew up and became unusable for at least
> >> 20-30
> >> >> > minutes. After 20-30 minutes, we initiated recovery protocols.
> >> >> >
> >> >> > Should I try
> >> >> >
> >> >> > day 3 - transaction start, no table lock, delete objects
> >> >> >
> >> >> > ? is the table lock too coarse grained ? considering that the
> cluster
> >> has
> >> >> > blown up twice, i'm obviously a little scared to try another
> >> >> variation....
> >> >> >
> >> >> > --b
> >> >> >
> >> >> >
> >> >> > On Fri, Apr 2, 2010 at 5:47 AM, Ovidiu Deac <ovidiudeac@REDACTED>
> >> >> wrote:
> >> >> >
> >> >> >> To me it sounds like another example of premature optimization
> which
> >> >> >> went wrong? :)
> >> >> >>
> >> >> >> On Fri, Apr 2, 2010 at 10:19 AM, Dan Gudmundsson <dgud@REDACTED
> >
> >> >> wrote:
> >> >> >> > When you are using dirty, every operation is sent separately to
> all
> >> >> >> nodes,
> >> >> >> > i.e. 192593*6 messages, actually a transaction could have been
> >> faster
> >> >> >> > in this case.
> >> >> >> > With one message (large) containing all ops to each node.
> >> >> >> >
> >> >> >> > What you get is an overloaded mnesia_tm (very long msg queues),
> >> >> >> > which do the actual writing of the data on the other
> (participating
> >> >> >> > mnesia nodes).
> >> >> >> >
> >> >> >> > So transactions will be blocked waiting on mnesia_tm to process
> >> those
> >> >> >> 200000
> >> >> >> > messages on the other nodes.
> >> >> >> >
> >> >> >> > /Dan
> >> >> >> >
> >> >> >> > On Fri, Apr 2, 2010 at 1:11 AM, Brian Acton <acton@REDACTED
> >
> >> >> wrote:
> >> >> >> >> Hi guys,
> >> >> >> >>
> >> >> >> >> I am running R13B04 SMP on FreeBSD 7.3. I have a cluster of 7
> >> nodes
> >> >> >> running
> >> >> >> >> mnesia.
> >> >> >> >>
> >> >> >> >> I have a table of 1196143 records using about 1.504GB of
> storage.
> >> >> It's a
> >> >> >> >> reasonably hot table doing a fair number of insert operations
> at
> >> any
> >> >> >> given
> >> >> >> >> time.
> >> >> >> >>
> >> >> >> >> I decided that since there was a 2GB limit in mnesia that I
> should
> >> do
> >> >> >> some
> >> >> >> >> cleanup on the system and specifically this table.
> >> >> >> >>
> >> >> >> >> Trying to avoid major problems with Mnesia, transaction load,
> and
> >> >> >> deadlock,
> >> >> >> >> I decided to do dirty_select and dirty_delete_object
> individually
> >> on
> >> >> the
> >> >> >> >> records.
> >> >> >> >>
> >> >> >> >> I started slow, deleting first 10, then 100, then 1000, then
> >> 10000,
> >> >> then
> >> >> >> >> 100,000 records. My goal was to delete 192593 records total.
> >> >> >> >>
> >> >> >> >> The first five deletions went through nicely and caused minimal
> to
> >> no
> >> >> >> >> impact.
> >> >> >> >>
> >> >> >> >> Unfortunately, the very last delete blew up the system. My
> delete
> >> >> >> command
> >> >> >> >> completed successfully but on the other nodes, it caused mnesia
> to
> >> >> get
> >> >> >> stuck
> >> >> >> >> on pending transactions, caused my message queues to fill up
> and
> >> >> >> basically
> >> >> >> >> brought down the whole system. We saw some mnesia is overloaded
> >> >> messages
> >> >> >> in
> >> >> >> >> our logs on these nodes but did not see a ton of them.
> >> >> >> >>
> >> >> >> >> Does anyone have any clues on what went wrong? I am attaching
> my
> >> code
> >> >> >> below
> >> >> >> >> for your review.
> >> >> >> >>
> >> >> >> >> --b
> >> >> >> >>
> >> >> >> >> Mnesia configuration tunables:
> >> >> >> >>
> >> >> >> >>      -mnesia no_table_loaders 20
> >> >> >> >>      -mnesia dc_dump_limit 40
> >> >> >> >>      -mnesia dump_log_write_threshold 10000
> >> >> >> >>
> >> >> >> >> Example error message:
> >> >> >> >>
> >> >> >> >> ** WARNING ** Mnesia is overloaded: {mnesia_tm,
> message_queue_len,
> >> >> >> >> [387,842]}
> >> >> >> >>
> >> >> >> >> Sample code:
> >> >> >> >>
> >> >> >> >> Select = fun(Days) ->
> >> >> >> >>         {MegaSecs, Secs, _MicroSecs} = now(),
> >> >> >> >>         T = MegaSecs * 1000000 + Secs - 86400 * Days,
> >> >> >> >>         TimeStamp = {T div 1000000, T rem 1000000, 0},
> >> >> >> >>         mnesia:dirty_select(offline_msg,
> >> >> >> >>                     [{'$1',
> >> >> >> >>                       [{'<', {element, 3, '$1'},
> >> >> >> >>                     {TimeStamp} }],
> >> >> >> >>                       ['$1']}])
> >> >> >> >>     end.
> >> >> >> >>
> >> >> >> >> Count = fun(Days) -> length(Select(Days)) end.
> >> >> >> >>
> >> >> >> >> Delete = fun(Days, Total) ->
> >> >> >> >>         C = Select(Days),
> >> >> >> >>         D = lists:sublist(C, Total),
> >> >> >> >>         lists:foreach(fun(Rec) ->
> >> >> >> >>                       ok = mnesia:dirty_delete_object(Rec)
> >> >> >> >>                   end,
> >> >> >> >>                   D),
> >> >> >> >>         length(D)
> >> >> >> >>     end.
> >> >> >> >>
> >> >> >> >
> >> >> >> > ________________________________________________________________
> >> >> >> > erlang-questions (at) erlang.org mailing list.
> >> >> >> > See http://www.erlang.org/faq.html
> >> >> >> > To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
> >> >> >> >
> >> >> >> >
> >> >> >>
> >> >> >
> >> >>
> >> >
> >>
> >
>