[erlang-questions] Mnesia deadlock with large volume of dirty operations?

Fri Apr 2 22:19:09 CEST 2010

On this particular table, I do not want to delete all entries. This is why I
posted a separate post to the mailing list. Combining the two threads back,
I want:

One table, I want to delete entries > n days.
Another table, I want to delete all entries.

Both tables are reasonably hot (~1-2 ops per second) and reasonably large (>
1.5GB). I'm hitting the 2GB limit and I need to clean up these tables.

So far, any attempts at maintenance (as outlined in previous emails) have
resulted in Mnesia seizing up and bringing down the cluster.

It sounds like I have to do this in very small increments with wait time
between increments. However, I do not have a method and mechanism for
determining the size of an increment or a wait time between increments. I'm
fine doing ten deletes per 1 second if that's what it takes. However, I'd
like to be able to figure out the maximum number of deletes that I can do in
the minimum amount of time.

I'm definitely open to suggestion on this.

--b

On Fri, Apr 2, 2010 at 1:05 PM, Dan Gudmundsson <dgud@REDACTED> wrote:

> clear_table is the fastest way you can delete it, but it will take a
> while when there is a lot of data.
>
> /Dan
>
> On Fri, Apr 2, 2010 at 8:22 PM, Brian Acton <acton@REDACTED> wrote:
> > I'm sorry. I neglected to tell you what I had done on the previous day.
> >
> > On the previous day, I had attempted to delete some old records using
> this
> > methodology:
> >
> >                mnesia:write_lock_table(offline_msg),
> >                mnesia:foldl(
> >                  fun(Rec, _Acc) ->
> >                          case Rec#offline_msg.expire of
> >                              never ->
> >                                  ok;
> >                              TS ->
> >                                  if
> >                                      TS < TimeStamp ->
> >                                          mnesia:delete_object(Rec);
> >                                      true ->
> >                                          ok
> >                                  end
> >                          end
> >                  end, ok, offline_msg)
> >
> >
> > This delete finished on the 1st node but subsequently locked up all the
> > other nodes on a table lock. The cluster blew up and my 24/7 service went
> > into 1 hr of recovery of downtime.
> >
> > So to recap,
> >
> > on day 1 - transaction start, table lock, delete objects - finished in
> about
> > 2 minutes
> > on day 2 - dirty select, dirty delete objects - finished in about 2
> minutes
> >
> > In both cases, the cluster blew up and became unusable for at least 20-30
> > minutes. After 20-30 minutes, we initiated recovery protocols.
> >
> > Should I try
> >
> > day 3 - transaction start, no table lock, delete objects
> >
> > ? is the table lock too coarse grained ? considering that the cluster has
> > blown up twice, i'm obviously a little scared to try another
> variation....
> >
> > --b
> >
> >
> > On Fri, Apr 2, 2010 at 5:47 AM, Ovidiu Deac <ovidiudeac@REDACTED>
> wrote:
> >
> >> To me it sounds like another example of premature optimization which
> >> went wrong? :)
> >>
> >> On Fri, Apr 2, 2010 at 10:19 AM, Dan Gudmundsson <dgud@REDACTED>
> wrote:
> >> > When you are using dirty, every operation is sent separately to all
> >> nodes,
> >> > i.e. 192593*6 messages, actually a transaction could have been faster
> >> > in this case.
> >> > With one message (large) containing all ops to each node.
> >> >
> >> > What you get is an overloaded mnesia_tm (very long msg queues),
> >> > which do the actual writing of the data on the other (participating
> >> > mnesia nodes).
> >> >
> >> > So transactions will be blocked waiting on mnesia_tm to process those
> >> 200000
> >> > messages on the other nodes.
> >> >
> >> > /Dan
> >> >
> >> > On Fri, Apr 2, 2010 at 1:11 AM, Brian Acton <acton@REDACTED>
> wrote:
> >> >> Hi guys,
> >> >>
> >> >> I am running R13B04 SMP on FreeBSD 7.3. I have a cluster of 7 nodes
> >> running
> >> >> mnesia.
> >> >>
> >> >> I have a table of 1196143 records using about 1.504GB of storage.
> It's a
> >> >> reasonably hot table doing a fair number of insert operations at any
> >> given
> >> >> time.
> >> >>
> >> >> I decided that since there was a 2GB limit in mnesia that I should do
> >> some
> >> >> cleanup on the system and specifically this table.
> >> >>
> >> >> Trying to avoid major problems with Mnesia, transaction load, and
> >> deadlock,
> >> >> I decided to do dirty_select and dirty_delete_object individually on
> the
> >> >> records.
> >> >>
> >> >> I started slow, deleting first 10, then 100, then 1000, then 10000,
> then
> >> >> 100,000 records. My goal was to delete 192593 records total.
> >> >>
> >> >> The first five deletions went through nicely and caused minimal to no
> >> >> impact.
> >> >>
> >> >> Unfortunately, the very last delete blew up the system. My delete
> >> command
> >> >> completed successfully but on the other nodes, it caused mnesia to
> get
> >> stuck
> >> >> on pending transactions, caused my message queues to fill up and
> >> basically
> >> >> brought down the whole system. We saw some mnesia is overloaded
> messages
> >> in
> >> >> our logs on these nodes but did not see a ton of them.
> >> >>
> >> >> Does anyone have any clues on what went wrong? I am attaching my code
> >> below
> >> >> for your review.
> >> >>
> >> >> --b
> >> >>
> >> >> Mnesia configuration tunables:
> >> >>
> >> >>      -mnesia no_table_loaders 20
> >> >>      -mnesia dc_dump_limit 40
> >> >>      -mnesia dump_log_write_threshold 10000
> >> >>
> >> >> Example error message:
> >> >>
> >> >> ** WARNING ** Mnesia is overloaded: {mnesia_tm, message_queue_len,
> >> >> [387,842]}
> >> >>
> >> >> Sample code:
> >> >>
> >> >> Select = fun(Days) ->
> >> >>         {MegaSecs, Secs, _MicroSecs} = now(),
> >> >>         T = MegaSecs * 1000000 + Secs - 86400 * Days,
> >> >>         TimeStamp = {T div 1000000, T rem 1000000, 0},
> >> >>         mnesia:dirty_select(offline_msg,
> >> >>                     [{'$1',
> >> >>                       [{'<', {element, 3, '$1'},
> >> >>                     {TimeStamp} }],
> >> >>                       ['$1']}])
> >> >>     end.
> >> >>
> >> >> Count = fun(Days) -> length(Select(Days)) end.
> >> >>
> >> >> Delete = fun(Days, Total) ->
> >> >>         C = Select(Days),
> >> >>         D = lists:sublist(C, Total),
> >> >>         lists:foreach(fun(Rec) ->
> >> >>                       ok = mnesia:dirty_delete_object(Rec)
> >> >>                   end,
> >> >>                   D),
> >> >>         length(D)
> >> >>     end.
> >> >>
> >> >
> >> > ________________________________________________________________
> >> > erlang-questions (at) erlang.org mailing list.
> >> > See http://www.erlang.org/faq.html
> >> > To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
> >> >
> >> >
> >>
> >
>