[erlang-questions] Performance with large queues

Sat Mar 20 02:56:26 CET 2010

Selective receive isn't super-slow if you are picking up messages at the top of the queue.  Remember that the mailbox is ordered, though.  It's possible that the process mailbox is collecting messages that never, ever get consumed.  In that case, memory usage would climb slowly, and receives would gradually slow down.  You'd probably see a tipping point like you describe.

It so happens, that you can use erlang:process_info(Pid,messages) should give you a copy of the mailbox.  Although, on a loaded system, perhaps you should just do message_queue_len, as the actual mailbox could be huge.

Alternatively, does this behavior happen if you restart the service every, say, four hours?  That would clearly indicate a problem related to accumulating one resource, such as messages or GC problems.  This might be particularly useful because, assuming you just restart that process, it would also rule out any accumulation problem in Mnesia itself.

Good luck!

On Mar 19, 2010, at 6:05 PM, Bernard Duggan wrote:

> Hi list.
> 
> I have a bit of an involved issue, and I'm not even entirely sure what
> the question I need to ask is, so I'll explain our setup, the problem,
> my theory and why I'm not 100% convinced I'm right :)
> 
> We have a couple of what I'll call "high load" processes.  At peak times
> these processes deal with a reasonably high number of messages (hundreds
> per second at least - I haven't measured exactly) and do a non-trivial
> amount of work on each one - most notably several mnesia operations, all
> contained within a single transaction.  They are implemented as
> gen_servers.  Most of the time, although they chew up a fair bit of CPU
> (maybe 50% of one core on a 4 core box at peak time), they chug along
> quite happily and keep up with what's being fed to them.   Twice in the
> last month, however, one of them has gotten into a state that has ended
> with a queue getting so big that it's exhausted the memory and crashed
> the VM.  Now that in itself wouldn't be a mystery - we've encountered it
> before with processes that simply can't service their queue fast enough
> and had that been the root of the issue then I'm quite happy that I know
> how to go about fixing it.
> 
> What's different in this case, however, is that once the queue passes a
> certain length (I can't say how long exactly - I've inferred most of
> this from crash dumps, CPU and memory use graphs and so on) the
> performance of the process drops drastically to the point that it's
> serving well under one message per second and even over the course of a
> night, where load drops to near-negligible levels, it doesn't even come
> close to catching up and clearing the queue.  (In the most recent case
> CPU and memory use started climbing at 2pm one day, memory use levelled
> out overnight (with the CPU still maxed out), then continued to climb
> the next day before crashing the VM at about 1:30pm).  From the logs, it
> appears that some messages are served quite quickly, but those resulting
> in mnesia operations are, by midnight, taking anything from 1 up to ~30
> seconds /each/.  The mnesia tables in question are rarely contended
> (there's one other process that uses them once a minute), so it's not
> that we're waiting for a contended lock.
> 
> So, my theory: I realised that, even though our code doesn't explicitly
> do a selective receive (and so can always just grab the first message on
> the queue) mnesia probably /does/ do one to get locks and that, in all
> likelihood, the cost of a selective receive goes O(N) with the length of
> the queue.  I imagine that once the queue has passed a certain point
> those selective receives increase the load on our process to the point
> that it can't keep up.  By the time the input load has died back down at
> night, the queue is so long (~1M messages) that it's taking a serious
> amount of time to traverse it, meaning that even over many hours the
> queue isn't significantly shrunk down (a queue of one million messages
> being processed at one message per second is still going to take 277
> hours to clear).
> 
> So why do I think I might be wrong?  Actually, having written all this
> down, I'm now less convinced that I am :)  It's more that I don't want
> to have missed some other possibility (garbage collection?  Something in
> the internals of gen_server?) or make pronouncements/decisions based on
> a theory that's flawed in some way I can't see for myself.
> 
> I've reworked the system so that incoming messages are delivered by
> gen_server:call which keeps the queues on the mnesia-using processes to
> a minimum and so far testing has looked pretty good.
> 
> So I guess my question is, does my theory stack up to what people
> familiar with the internals know?
> 
> (By the way - I'm aware that the system as it's described here is kind
> of terrible.  I've recently rewritten the bits in question to avoid
> mnesia entirely and the load is way down.  Unfortunately I'm in that
> position that every developer hates where I have to support an
> old-and-busted system for a while before the awesome new one has been
> through QA and into production.  Also I just want to make sure I have
> the best possible understanding of things that I can :))
> 
> Thanks very much if you read this far :)
> 
> Cheers,
> 
> Bernard
> 
> ________________________________________________________________
> erlang-questions (at) erlang.org mailing list.
> See http://www.erlang.org/faq.html
> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
> 

-- 
Jayson Vantuyl
kagato@REDACTED