[erlang-questions] Performance analysis advice

Tue Feb 11 01:45:56 CET 2014

Hi,

I've been recently trying to find and eliminate bottlenecks in a web service.  I had a few cases I knew were bottlenecks which I eliminated, and I've reached the point where I'm not sure what to look into next.

The server is basically

- webmachine
- 3 backend services fronted by dispcount
- each request is an erlang process which calls either 2 or 3 of the backend services in sequence, when a resource is unavailable the process will erlang:yield/0 and try again next time it is scheduled.
- very little is done per request (or at least fprof doesn't show any hotspots).

The current issue seems to a performance knee of some sort.  As I add traffic to my test node I see CPU utilization spike at around 200 concurrent requests.  It's a pretty dramatic knee, here's a sample of the data points (columns are timestamp, number of established connections 'netstat -n' and %cpu 'top -b').

2014/02/11-00:34:02 0 8.0
2014/02/11-00:34:04 0 12.0
2014/02/11-00:34:06 14 95.8
2014/02/11-00:34:08 30 153.8
2014/02/11-00:34:10 34 169.7
2014/02/11-00:34:12 64 269.6
2014/02/11-00:34:14 58 311.5
2014/02/11-00:34:16 81 385.4
2014/02/11-00:34:18 94 463.3
2014/02/11-00:34:21 114 523.4
2014/02/11-00:34:23 114 595.0
2014/02/11-00:34:25 153 670.9
2014/02/11-00:34:27 149 777.2
2014/02/11-00:34:29 165 1359.5
2014/02/11-00:34:31 198 1529.3
2014/02/11-00:34:33 350 1517.4
2014/02/11-00:34:36 369 1523.2
2014/02/11-00:34:38 355 1525.3

It's a 16 core, so 1500% is pretty much maxed out.  But notice the jump between 149 connections and 198 connections, it's almost double the CPU.

I tried running the lock counting profiler during running and it looks like this

            lock     id  #tries  #collisions  collisions [%]  time [us]  duration [%]
           -----    --- ------- ------------ --------------- ---------- -------------
         pollset      1 2369355        71578          3.0210     553249        0.7896
       proc_main 154649 6315238        13325          0.2110     442908        0.6322
    drv_ev_state     16 2311870        14609          0.6319     140019        0.1998
       run_queue     16 9167097        55212          0.6023     117993        0.1684
       proc_link 154649 2173804         3693          0.1699     111909        0.1597
     proc_status 154649 5816882         3037          0.0522     106664        0.1522
       proc_msgq 154649 3972636         3305          0.0832      56049        0.0800
   process_table      1  444931         4711          1.0588      56042        0.0800
       timeofday      1 1323351         6874          0.5194      22078        0.0315
        atom_tab      1 1902423           22          0.0012       2187        0.0031
     timer_wheel      1  582235         1186          0.2037       1322        0.0019
 pollset_rm_list      1 1042209         2035          0.1953        942        0.0013
        make_ref      1  373475          592          0.1585        458        0.0007
    db_hash_slot    576  646402          203          0.0314        244        0.0003
  alcu_allocator      9   18806           16          0.0851        212        0.0003

However, I'm not entirely sure how to interpret it, poll set seems have to do with scheduling, and I'm sure my use of yield is not recommended, but I don't think that would explain the knee I seem to be seeing.

Anyway, I'm looking for any advice from others who might have some experience optimizing who might be able to point out what I'm doing wrong or point to libraries, talks, tools, documentation, or other things I might have missed.

Thanks,

-Anthony