[erlang-questions] Performance with large queues

Sat Mar 20 23:32:55 CET 2010

Hi Bernard,

Le 20 mars 2010 à 02:05, Bernard Duggan a écrit :

> Hi list.
> 
> I have a bit of an involved issue, and I'm not even entirely sure what
> the question I need to ask is, so I'll explain our setup, the problem,
> my theory and why I'm not 100% convinced I'm right :)
> 
> We have a couple of what I'll call "high load" processes.  At peak times
> these processes deal with a reasonably high number of messages (hundreds
> per second at least - I haven't measured exactly) and do a non-trivial
> amount of work on each one - most notably several mnesia operations, all
> contained within a single transaction.  They are implemented as
> gen_servers.  Most of the time, although they chew up a fair bit of CPU
> (maybe 50% of one core on a 4 core box at peak time), they chug along
> quite happily and keep up with what's being fed to them.   Twice in the
> last month, however, one of them has gotten into a state that has ended
> with a queue getting so big that it's exhausted the memory and crashed
> the VM.  Now that in itself wouldn't be a mystery - we've encountered it
> before with processes that simply can't service their queue fast enough
> and had that been the root of the issue then I'm quite happy that I know
> how to go about fixing it.
> 
> What's different in this case, however, is that once the queue passes a
> certain length (I can't say how long exactly - I've inferred most of
> this from crash dumps, CPU and memory use graphs and so on) the
> performance of the process drops drastically to the point that it's
> serving well under one message per second and even over the course of a
> night, where load drops to near-negligible levels, it doesn't even come
> close to catching up and clearing the queue.  (In the most recent case
> CPU and memory use started climbing at 2pm one day, memory use levelled
> out overnight (with the CPU still maxed out), then continued to climb
> the next day before crashing the VM at about 1:30pm).  From the logs, it
> appears that some messages are served quite quickly, but those resulting
> in mnesia operations are, by midnight, taking anything from 1 up to ~30
> seconds /each/.  The mnesia tables in question are rarely contended
> (there's one other process that uses them once a minute), so it's not
> that we're waiting for a contended lock.
> 
> So, my theory: I realised that, even though our code doesn't explicitly
> do a selective receive (and so can always just grab the first message on
> the queue) mnesia probably /does/ do one to get locks and that, in all
> likelihood, the cost of a selective receive goes O(N) with the length of
> the queue.  I imagine that once the queue has passed a certain point
> those selective receives increase the load on our process to the point
> that it can't keep up.  By the time the input load has died back down at
> night, the queue is so long (~1M messages) that it's taking a serious
> amount of time to traverse it, meaning that even over many hours the
> queue isn't significantly shrunk down (a queue of one million messages
> being processed at one message per second is still going to take 277
> hours to clear).
> 
> So why do I think I might be wrong?  Actually, having written all this
> down, I'm now less convinced that I am :)  It's more that I don't want
> to have missed some other possibility (garbage collection?  Something in
> the internals of gen_server?) or make pronouncements/decisions based on
> a theory that's flawed in some way I can't see for myself.
> 
> I've reworked the system so that incoming messages are delivered by
> gen_server:call which keeps the queues on the mnesia-using processes to
> a minimum and so far testing has looked pretty good.
> 
> So I guess my question is, does my theory stack up to what people
> familiar with the internals know?
> 
> (By the way - I'm aware that the system as it's described here is kind
> of terrible.  I've recently rewritten the bits in question to avoid
> mnesia entirely and the load is way down.  Unfortunately I'm in that
> position that every developer hates where I have to support an
> old-and-busted system for a while before the awesome new one has been
> through QA and into production.  Also I just want to make sure I have
> the best possible understanding of things that I can :))
> 
> Thanks very much if you read this far :)
> 
> Cheers,
> 
> Bernard

I think your analysis is correct (and as others have, said, the selective receives are done by gen_servers). The issue was nicely explained (with an easily reproducible test, if you're running Linux) by Pascal Brisset 5 years ago:

http://www.erlang.org/cgi-bin/ezmlm-cgi/4/17758

Best regards,

Dominic Williams
http://dominicwilliams.net

---