[erlang-questions] Performance analysis advice

Tue Feb 11 02:04:57 CET 2014

Hi Anthony,

Glad to see someone using dispcount!

Can I ask what are your pool sizes? Can the time increase correlate with
the time you start getting busy or retrying a lot?

I know that what you could be seeing is locking in dispcount itself. To
allow fair redistribution of retries with dispcount, it makes use of
erlang:now/0 which could easily be the problem you have there.

See https://github.com/ferd/dispcount/blob/master/src/dispcount_watcher.erl#L148-149

I had never seen it be a problem before so I never optimized it away,
but you could try subbing it with:

    dispatch_id(Num) ->
        erlang:phash2(local_unique(), Num) + 1.

    local_unique() ->
        Ct = case get('$dispcount_ct') of
            undefined -> put('$dispcount_ct', 1), 1;
            N -> put('$dispcount_ct', N+1), N+1
        end,
        {Ct,self()}.

That might be enough to keep giving even distribution, without needing
external data (now()) -- the pid and counter could lead to enough
uniqueness overall to work fine.

If it goes well, let me know and I'll patch dispcount for this, and ask
people I know who use it if they feel like giving it a try.

Regards,
Fred.

On 02/10, ANTHONY MOLINARO wrote:
> Hi,
> 
> I've been recently trying to find and eliminate bottlenecks in a web service.  I had a few cases I knew were bottlenecks which I eliminated, and I've reached the point where I'm not sure what to look into next.
> 
> The server is basically
> 
> - webmachine
> - 3 backend services fronted by dispcount
> - each request is an erlang process which calls either 2 or 3 of the backend services in sequence, when a resource is unavailable the process will erlang:yield/0 and try again next time it is scheduled.
> - very little is done per request (or at least fprof doesn't show any hotspots).
> 
> The current issue seems to a performance knee of some sort.  As I add traffic to my test node I see CPU utilization spike at around 200 concurrent requests.  It's a pretty dramatic knee, here's a sample of the data points (columns are timestamp, number of established connections 'netstat -n' and %cpu 'top -b').
> 
> 2014/02/11-00:34:02 0 8.0
> 2014/02/11-00:34:04 0 12.0
> 2014/02/11-00:34:06 14 95.8
> 2014/02/11-00:34:08 30 153.8
> 2014/02/11-00:34:10 34 169.7
> 2014/02/11-00:34:12 64 269.6
> 2014/02/11-00:34:14 58 311.5
> 2014/02/11-00:34:16 81 385.4
> 2014/02/11-00:34:18 94 463.3
> 2014/02/11-00:34:21 114 523.4
> 2014/02/11-00:34:23 114 595.0
> 2014/02/11-00:34:25 153 670.9
> 2014/02/11-00:34:27 149 777.2
> 2014/02/11-00:34:29 165 1359.5
> 2014/02/11-00:34:31 198 1529.3
> 2014/02/11-00:34:33 350 1517.4
> 2014/02/11-00:34:36 369 1523.2
> 2014/02/11-00:34:38 355 1525.3
> 
> It's a 16 core, so 1500% is pretty much maxed out.  But notice the jump between 149 connections and 198 connections, it's almost double the CPU.
> 
> I tried running the lock counting profiler during running and it looks like this
> 
>             lock     id  #tries  #collisions  collisions [%]  time [us]  duration [%]
>            -----    --- ------- ------------ --------------- ---------- -------------
>          pollset      1 2369355        71578          3.0210     553249        0.7896
>        proc_main 154649 6315238        13325          0.2110     442908        0.6322
>     drv_ev_state     16 2311870        14609          0.6319     140019        0.1998
>        run_queue     16 9167097        55212          0.6023     117993        0.1684
>        proc_link 154649 2173804         3693          0.1699     111909        0.1597
>      proc_status 154649 5816882         3037          0.0522     106664        0.1522
>        proc_msgq 154649 3972636         3305          0.0832      56049        0.0800
>    process_table      1  444931         4711          1.0588      56042        0.0800
>        timeofday      1 1323351         6874          0.5194      22078        0.0315
>         atom_tab      1 1902423           22          0.0012       2187        0.0031
>      timer_wheel      1  582235         1186          0.2037       1322        0.0019
>  pollset_rm_list      1 1042209         2035          0.1953        942        0.0013
>         make_ref      1  373475          592          0.1585        458        0.0007
>     db_hash_slot    576  646402          203          0.0314        244        0.0003
>   alcu_allocator      9   18806           16          0.0851        212        0.0003
> 
> However, I'm not entirely sure how to interpret it, poll set seems have to do with scheduling, and I'm sure my use of yield is not recommended, but I don't think that would explain the knee I seem to be seeing.
> 
> Anyway, I'm looking for any advice from others who might have some experience optimizing who might be able to point out what I'm doing wrong or point to libraries, talks, tools, documentation, or other things I might have missed.
> 
> Thanks,
> 
> -Anthony
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions