Performance with large queues

Sat Mar 20 02:05:15 CET 2010

Hi list.

I have a bit of an involved issue, and I'm not even entirely sure what
the question I need to ask is, so I'll explain our setup, the problem,
my theory and why I'm not 100% convinced I'm right :)

We have a couple of what I'll call "high load" processes.  At peak times
these processes deal with a reasonably high number of messages (hundreds
per second at least - I haven't measured exactly) and do a non-trivial
amount of work on each one - most notably several mnesia operations, all
contained within a single transaction.  They are implemented as
gen_servers.  Most of the time, although they chew up a fair bit of CPU
(maybe 50% of one core on a 4 core box at peak time), they chug along
quite happily and keep up with what's being fed to them.   Twice in the
last month, however, one of them has gotten into a state that has ended
with a queue getting so big that it's exhausted the memory and crashed
the VM.  Now that in itself wouldn't be a mystery - we've encountered it
before with processes that simply can't service their queue fast enough
and had that been the root of the issue then I'm quite happy that I know
how to go about fixing it.

What's different in this case, however, is that once the queue passes a
certain length (I can't say how long exactly - I've inferred most of
this from crash dumps, CPU and memory use graphs and so on) the
performance of the process drops drastically to the point that it's
serving well under one message per second and even over the course of a
night, where load drops to near-negligible levels, it doesn't even come
close to catching up and clearing the queue.  (In the most recent case
CPU and memory use started climbing at 2pm one day, memory use levelled
out overnight (with the CPU still maxed out), then continued to climb
the next day before crashing the VM at about 1:30pm).  From the logs, it
appears that some messages are served quite quickly, but those resulting
in mnesia operations are, by midnight, taking anything from 1 up to ~30
seconds /each/.  The mnesia tables in question are rarely contended
(there's one other process that uses them once a minute), so it's not
that we're waiting for a contended lock.

So, my theory: I realised that, even though our code doesn't explicitly
do a selective receive (and so can always just grab the first message on
the queue) mnesia probably /does/ do one to get locks and that, in all
likelihood, the cost of a selective receive goes O(N) with the length of
the queue.  I imagine that once the queue has passed a certain point
those selective receives increase the load on our process to the point
that it can't keep up.  By the time the input load has died back down at
night, the queue is so long (~1M messages) that it's taking a serious
amount of time to traverse it, meaning that even over many hours the
queue isn't significantly shrunk down (a queue of one million messages
being processed at one message per second is still going to take 277
hours to clear).

So why do I think I might be wrong?  Actually, having written all this
down, I'm now less convinced that I am :)  It's more that I don't want
to have missed some other possibility (garbage collection?  Something in
the internals of gen_server?) or make pronouncements/decisions based on
a theory that's flawed in some way I can't see for myself.

I've reworked the system so that incoming messages are delivered by
gen_server:call which keeps the queues on the mnesia-using processes to
a minimum and so far testing has looked pretty good.

So I guess my question is, does my theory stack up to what people
familiar with the internals know?

(By the way - I'm aware that the system as it's described here is kind
of terrible.  I've recently rewritten the bits in question to avoid
mnesia entirely and the load is way down.  Unfortunately I'm in that
position that every developer hates where I have to support an
old-and-busted system for a while before the awesome new one has been
through QA and into production.  Also I just want to make sure I have
the best possible understanding of things that I can :))

Thanks very much if you read this far :)

Cheers,

Bernard