[erlang-questions] Performance with large queues

Sat Mar 20 02:38:10 CET 2010

I can't speak to mnesia but gen_servers do

On Mar 19, 2010, at 9:05 PM, Bernard Duggan <bernie@REDACTED> wrote:

> Hi list.
>
> I have a bit of an involved issue, and I'm not even entirely sure what
> the question I need to ask is, so I'll explain our setup, the problem,
> my theory and why I'm not 100% convinced I'm right :)
>
> We have a couple of what I'll call "high load" processes.  At peak  
> times
> these processes deal with a reasonably high number of messages  
> (hundreds
> per second at least - I haven't measured exactly) and do a non-trivial
> amount of work on each one - most notably several mnesia operations,  
> all
> contained within a single transaction.  They are implemented as
> gen_servers.  Most of the time, although they chew up a fair bit of  
> CPU
> (maybe 50% of one core on a 4 core box at peak time), they chug along
> quite happily and keep up with what's being fed to them.   Twice in  
> the
> last month, however, one of them has gotten into a state that has  
> ended
> with a queue getting so big that it's exhausted the memory and crashed
> the VM.  Now that in itself wouldn't be a mystery - we've  
> encountered it
> before with processes that simply can't service their queue fast  
> enough
> and had that been the root of the issue then I'm quite happy that I  
> know
> how to go about fixing it.
>
> What's different in this case, however, is that once the queue  
> passes a
> certain length (I can't say how long exactly - I've inferred most of
> this from crash dumps, CPU and memory use graphs and so on) the
> performance of the process drops drastically to the point that it's
> serving well under one message per second and even over the course  
> of a
> night, where load drops to near-negligible levels, it doesn't even  
> come
> close to catching up and clearing the queue.  (In the most recent case
> CPU and memory use started climbing at 2pm one day, memory use  
> levelled
> out overnight (with the CPU still maxed out), then continued to climb
> the next day before crashing the VM at about 1:30pm).  From the  
> logs, it
> appears that some messages are served quite quickly, but those  
> resulting
> in mnesia operations are, by midnight, taking anything from 1 up to  
> ~30
> seconds /each/.  The mnesia tables in question are rarely contended
> (there's one other process that uses them once a minute), so it's not
> that we're waiting for a contended lock.
>
> So, my theory: I realised that, even though our code doesn't  
> explicitly
> do a selective receive (and so can always just grab the first  
> message on
> the queue) mnesia probably /does/ do one to get locks and that, in all
> likelihood, the cost of a selective receive goes O(N) with the  
> length of
> the queue.  I imagine that once the queue has passed a certain point
> those selective receives increase the load on our process to the point
> that it can't keep up.  By the time the input load has died back  
> down at
> night, the queue is so long (~1M messages) that it's taking a serious
> amount of time to traverse it, meaning that even over many hours the
> queue isn't significantly shrunk down (a queue of one million messages
> being processed at one message per second is still going to take 277
> hours to clear).
>
> So why do I think I might be wrong?  Actually, having written all this
> down, I'm now less convinced that I am :)  It's more that I don't want
> to have missed some other possibility (garbage collection? Something  
> in
> the internals of gen_server?) or make pronouncements/decisions based  
> on
> a theory that's flawed in some way I can't see for myself.
>
> I've reworked the system so that incoming messages are delivered by
> gen_server:call which keeps the queues on the mnesia-using processes  
> to
> a minimum and so far testing has looked pretty good.
>
> So I guess my question is, does my theory stack up to what people
> familiar with the internals know?
>
> (By the way - I'm aware that the system as it's described here is kind
> of terrible.  I've recently rewritten the bits in question to avoid
> mnesia entirely and the load is way down.  Unfortunately I'm in that
> position that every developer hates where I have to support an
> old-and-busted system for a while before the awesome new one has been
> through QA and into production.  Also I just want to make sure I have
> the best possible understanding of things that I can :))
>
> Thanks very much if you read this far :)
>
> Cheers,
>
> Bernard
>
> ________________________________________________________________
> erlang-questions (at) erlang.org mailing list.
> See http://www.erlang.org/faq.html
> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>