Megaco Avalanche! (was: ets:update_counter/3 taking a looooong time!)

Mon May 5 12:08:09 CEST 2003

Hi Peter,

I have been trying to reproduce your problem without success. 
I have some questions and suggestions.

  q1: Have you changed the max counter value (max_trans_id)?
      A small value would lead to reset_trans_id_counter more
      often (I admit that from the fprof output that does not 
      seem to be the case).

  q2: The fprof results included here, where they produced 
      after the system beeing "warmed up"?

  s1: When fprof'ing, use fprof:apply(Func, Args, OptionList)
      and include the megaco system processes megaco_config
      and megaco_monitor ({procs, [self(),megaco_config,megaco_monitor]}).

  s2: In order to get more reliable results, run more then one
      set of call-setup: N*(add, modify & subtract).

  s3: Try setting prio of the megaco_config process to high.

/BMK

Peter-Henry Mander writes:
 > Hi Scott,
 > 
 > Well, I did a bit of hacking and then realised I should use my brain too!
 > 
 > First I replaced megaco_config:incr_trans_id_counter/1 with something I 
 > rolled myself, a very simple tail-recursive counter server that didn't 
 > use ets. Not because of any scheme or strategy I had, just because I 
 > can! And the result was a noticable but minor improvement. The fprof 
 > trace now show that blocking occurs on waiting for a reply, but the 
 > performance still sucks. The media gateway end now held up the gateway 
 > controller beyond 18 concurrent processes instead of 16.
 > 
 > Since I haven't managed to conclusively fprof the media gateway 
 > (possibly due to lack of experience I'm sorry to say), I decided to see 
 > if things improved by adding a 20ms pause between launching each 
 > process. Maybe all 2000 processes were trying to access ets at once, 
 > effectively an avalanche of processes. The performance sucked a bit less 
 > beyond 18 processes, but the call rate was a constant 36 cps all the way 
 > up to 2000 concurrent processes maintaining 2000 open calls. So the 
 > avalanche hypothesis seemed correct.
 > 
 > This figure was a lot less than the 400 cps I get doing a "churning" 
 > call cycle running on a dozen threads. Each thread repeatedly does a 
 > Megaco add followed by modify and subtract as fast as possible. So I 
 > know that this rate is achievable and will remain stable and constant 
 > for hours on end without anything going pop!
 > 
 > I then tweaked the launching code to introduce a 20ms pause after 
 > starting seven processes at a time, seven being a trial-and-error 
 > figure. This backs off process launching just enough to prevent the 
 > avalanche effet and now I can open up 2000 calls at a rate of 330 cps. 
 > Not quite 400 cps, but sufficient and an order of magnitude better!
 > 
 > So, not exactly an ets problem (I'm using Linux, not FreeBSD), but I 
 > haven't reversed my hacks to the megaco stack to see if there is any 
 > significant speed gains through avoiding ets in this situation. Probably 
 > not, I've been assured that ets is perfectly fast enough.
 > 
 > I hope this little tail helps someone out there, it's not always clearly 
 > obvious what's wrong with your code when an process avalanche situation 
 > occurs. Ah the joys of massively concurrent systems (-:
 > 
 > Pete.
 > 
 > Scott Lystig Fritchie wrote:
 > >>>>>>"pm" == Peter-Henry Mander <erlang@REDACTED> writes:
 > >>>>>
 > > 
 > > pm> The attached gives the output of fprof, and the last line
 > > pm> indicates that all the time is spent waiting for
 > > pm> ets:update_counter/3 to return a new transaction ID to
 > > pm> megaco_config:incr_trans_id_counter/1.
 > > 
 > > I've got two theories.
 > > 
 > > 1. Has your Erlang VM's size grown so large that your OS has started
 > >    paging memory to disk to make room?  Or has some other OS process
 > >    started hogging CPU cycles?
 > > 
 > >    Er, well, those are easy guess, and surprisingly easy to forget
 > >    about if you're distracted by other things.
 > > 
 > > 2. Is your OS platform FreeBSD (or perhaps one of the other *BSDs)?
 > > 
 > >    I've been doing some simple ETS benchmarks lately, and I've noticed
 > >    really weird behavior of ets:delete() (deleting lots of items in a
 > >    table or deleting an entire table at once) with FreeBSD 4.7-RELEASE
 > >    and 5.0-RELEASE and Erlang R9B-1.  If the table is large (tens of
 > >    thousands to millions of items), the delete operation can take up
 > >    to 40 times (!) longer than running on the exact same hardware
 > >    under a "modern" Linux (Mandrake 9.1 distribution).
 > > 
 > >    This was so surprising to me that I tried it on two different
 > >    machines, a Pentium III laptop and an AMD XP+ desktop.  Same thing:
 > >    FreeBSD was horrible in the delete case, Linux was not.
 > > 
 > >    I haven't chased down the final answer (hopefully I'll get back to
 > >    finding the answer and propose a fix) ... but "gprof" analysis
 > >    strongly suggests that libc's free() is the culprit.  Bummer.
 > > 
 > > -Scott
 > > 
 > > 
 > 
 > 

-- 
Micael Karlberg          Ericsson AB, Älvsjö Sweden
Tel:  +46 8 727 5668     EAB/UHK/KD - OTP Product Development
ECN:  851 5668           Mail: micael.karlberg@REDACTED
Fax:  +46 8 727 5775