Megaco Avalanche! (was: ets:update_counter/3 taking a looooong time!)

Thu May 1 09:33:16 CEST 2003

Hi Scott,

Well, I did a bit of hacking and then realised I should use my brain too!

First I replaced megaco_config:incr_trans_id_counter/1 with something I 
rolled myself, a very simple tail-recursive counter server that didn't 
use ets. Not because of any scheme or strategy I had, just because I 
can! And the result was a noticable but minor improvement. The fprof 
trace now show that blocking occurs on waiting for a reply, but the 
performance still sucks. The media gateway end now held up the gateway 
controller beyond 18 concurrent processes instead of 16.

Since I haven't managed to conclusively fprof the media gateway 
(possibly due to lack of experience I'm sorry to say), I decided to see 
if things improved by adding a 20ms pause between launching each 
process. Maybe all 2000 processes were trying to access ets at once, 
effectively an avalanche of processes. The performance sucked a bit less 
beyond 18 processes, but the call rate was a constant 36 cps all the way 
up to 2000 concurrent processes maintaining 2000 open calls. So the 
avalanche hypothesis seemed correct.

This figure was a lot less than the 400 cps I get doing a "churning" 
call cycle running on a dozen threads. Each thread repeatedly does a 
Megaco add followed by modify and subtract as fast as possible. So I 
know that this rate is achievable and will remain stable and constant 
for hours on end without anything going pop!

I then tweaked the launching code to introduce a 20ms pause after 
starting seven processes at a time, seven being a trial-and-error 
figure. This backs off process launching just enough to prevent the 
avalanche effet and now I can open up 2000 calls at a rate of 330 cps. 
Not quite 400 cps, but sufficient and an order of magnitude better!

So, not exactly an ets problem (I'm using Linux, not FreeBSD), but I 
haven't reversed my hacks to the megaco stack to see if there is any 
significant speed gains through avoiding ets in this situation. Probably 
not, I've been assured that ets is perfectly fast enough.

I hope this little tail helps someone out there, it's not always clearly 
obvious what's wrong with your code when an process avalanche situation 
occurs. Ah the joys of massively concurrent systems (-:

Pete.

Scott Lystig Fritchie wrote:
>>>>>>"pm" == Peter-Henry Mander <erlang@REDACTED> writes:
>>>>>
> 
> pm> The attached gives the output of fprof, and the last line
> pm> indicates that all the time is spent waiting for
> pm> ets:update_counter/3 to return a new transaction ID to
> pm> megaco_config:incr_trans_id_counter/1.
> 
> I've got two theories.
> 
> 1. Has your Erlang VM's size grown so large that your OS has started
>    paging memory to disk to make room?  Or has some other OS process
>    started hogging CPU cycles?
> 
>    Er, well, those are easy guess, and surprisingly easy to forget
>    about if you're distracted by other things.
> 
> 2. Is your OS platform FreeBSD (or perhaps one of the other *BSDs)?
> 
>    I've been doing some simple ETS benchmarks lately, and I've noticed
>    really weird behavior of ets:delete() (deleting lots of items in a
>    table or deleting an entire table at once) with FreeBSD 4.7-RELEASE
>    and 5.0-RELEASE and Erlang R9B-1.  If the table is large (tens of
>    thousands to millions of items), the delete operation can take up
>    to 40 times (!) longer than running on the exact same hardware
>    under a "modern" Linux (Mandrake 9.1 distribution).
> 
>    This was so surprising to me that I tried it on two different
>    machines, a Pentium III laptop and an AMD XP+ desktop.  Same thing:
>    FreeBSD was horrible in the delete case, Linux was not.
> 
>    I haven't chased down the final answer (hopefully I'll get back to
>    finding the answer and propose a fix) ... but "gprof" analysis
>    strongly suggests that libc's free() is the culprit.  Bummer.
> 
> -Scott
> 
>