[erlang-questions] Performance with large queues

Sat Mar 20 02:44:17 CET 2010

Oopsie on the phone...

Gen_servers do alot of selective recieves to the point that someone  
wrote an optimized version called gen_server2. From what I understand  
it basically drains the process's message box and processes from a  
queue. I've heard of some pretty decent performance improvements with  
it but I haven't measured it myself.

I don't have a project URL off the top of my head but I know dynomite  
was using it at one point as well as a couple different projects.

That said you might also want to do some light measuring of your  
mnesia transaction times just in case. The first thing that popped  
into my head was message reception as well, but it might not hurt to  
check.

HTH,
Paul Davis

On Mar 19, 2010, at 9:05 PM, Bernard Duggan <bernie@REDACTED> wrote:

> Hi list.
>
> I have a bit of an involved issue, and I'm not even entirely sure what
> the question I need to ask is, so I'll explain our setup, the problem,
> my theory and why I'm not 100% convinced I'm right :)
>
> We have a couple of what I'll call "high load" processes.  At peak  
> times
> these processes deal with a reasonably high number of messages  
> (hundreds
> per second at least - I haven't measured exactly) and do a non-trivial
> amount of work on each one - most notably several mnesia operations,  
> all
> contained within a single transaction.  They are implemented as
> gen_servers.  Most of the time, although they chew up a fair bit of  
> CPU
> (maybe 50% of one core on a 4 core box at peak time), they chug along
> quite happily and keep up with what's being fed to them.   Twice in  
> the
> last month, however, one of them has gotten into a state that has  
> ended
> with a queue getting so big that it's exhausted the memory and crashed
> the VM.  Now that in itself wouldn't be a mystery - we've  
> encountered it
> before with processes that simply can't service their queue fast  
> enough
> and had that been the root of the issue then I'm quite happy that I  
> know
> how to go about fixing it.
>
> What's different in this case, however, is that once the queue  
> passes a
> certain length (I can't say how long exactly - I've inferred most of
> this from crash dumps, CPU and memory use graphs and so on) the
> performance of the process drops drastically to the point that it's
> serving well under one message per second and even over the course  
> of a
> night, where load drops to near-negligible levels, it doesn't even  
> come
> close to catching up and clearing the queue.  (In the most recent case
> CPU and memory use started climbing at 2pm one day, memory use  
> levelled
> out overnight (with the CPU still maxed out), then continued to climb
> the next day before crashing the VM at about 1:30pm).  From the  
> logs, it
> appears that some messages are served quite quickly, but those  
> resulting
> in mnesia operations are, by midnight, taking anything from 1 up to  
> ~30
> seconds /each/.  The mnesia tables in question are rarely contended
> (there's one other process that uses them once a minute), so it's not
> that we're waiting for a contended lock.
>
> So, my theory: I realised that, even though our code doesn't  
> explicitly
> do a selective receive (and so can always just grab the first  
> message on
> the queue) mnesia probably /does/ do one to get locks and that, in all
> likelihood, the cost of a selective receive goes O(N) with the  
> length of
> the queue.  I imagine that once the queue has passed a certain point
> those selective receives increase the load on our process to the point
> that it can't keep up.  By the time the input load has died back  
> down at
> night, the queue is so long (~1M messages) that it's taking a serious
> amount of time to traverse it, meaning that even over many hours the
> queue isn't significantly shrunk down (a queue of one million messages
> being processed at one message per second is still going to take 277
> hours to clear).
>
> So why do I think I might be wrong?  Actually, having written all this
> down, I'm now less convinced that I am :)  It's more that I don't want
> to have missed some other possibility (garbage collection? Something  
> in
> the internals of gen_server?) or make pronouncements/decisions based  
> on
> a theory that's flawed in some way I can't see for myself.
>
> I've reworked the system so that incoming messages are delivered by
> gen_server:call which keeps the queues on the mnesia-using processes  
> to
> a minimum and so far testing has looked pretty good.
>
> So I guess my question is, does my theory stack up to what people
> familiar with the internals know?
>
> (By the way - I'm aware that the system as it's described here is kind
> of terrible.  I've recently rewritten the bits in question to avoid
> mnesia entirely and the load is way down.  Unfortunately I'm in that
> position that every developer hates where I have to support an
> old-and-busted system for a while before the awesome new one has been
> through QA and into production.  Also I just want to make sure I have
> the best possible understanding of things that I can :))
>
> Thanks very much if you read this far :)
>
> Cheers,
>
> Bernard
>
> ________________________________________________________________
> erlang-questions (at) erlang.org mailing list.
> See http://www.erlang.org/faq.html
> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>