[erlang-questions] Performance analysis advice
Fred Hebert
mononcqc@REDACTED
Tue Feb 11 02:04:57 CET 2014
Hi Anthony,
Glad to see someone using dispcount!
Can I ask what are your pool sizes? Can the time increase correlate with
the time you start getting busy or retrying a lot?
I know that what you could be seeing is locking in dispcount itself. To
allow fair redistribution of retries with dispcount, it makes use of
erlang:now/0 which could easily be the problem you have there.
See https://github.com/ferd/dispcount/blob/master/src/dispcount_watcher.erl#L148-149
I had never seen it be a problem before so I never optimized it away,
but you could try subbing it with:
dispatch_id(Num) ->
erlang:phash2(local_unique(), Num) + 1.
local_unique() ->
Ct = case get('$dispcount_ct') of
undefined -> put('$dispcount_ct', 1), 1;
N -> put('$dispcount_ct', N+1), N+1
end,
{Ct,self()}.
That might be enough to keep giving even distribution, without needing
external data (now()) -- the pid and counter could lead to enough
uniqueness overall to work fine.
If it goes well, let me know and I'll patch dispcount for this, and ask
people I know who use it if they feel like giving it a try.
Regards,
Fred.
On 02/10, ANTHONY MOLINARO wrote:
> Hi,
>
> I've been recently trying to find and eliminate bottlenecks in a web service. I had a few cases I knew were bottlenecks which I eliminated, and I've reached the point where I'm not sure what to look into next.
>
> The server is basically
>
> - webmachine
> - 3 backend services fronted by dispcount
> - each request is an erlang process which calls either 2 or 3 of the backend services in sequence, when a resource is unavailable the process will erlang:yield/0 and try again next time it is scheduled.
> - very little is done per request (or at least fprof doesn't show any hotspots).
>
> The current issue seems to a performance knee of some sort. As I add traffic to my test node I see CPU utilization spike at around 200 concurrent requests. It's a pretty dramatic knee, here's a sample of the data points (columns are timestamp, number of established connections 'netstat -n' and %cpu 'top -b').
>
> 2014/02/11-00:34:02 0 8.0
> 2014/02/11-00:34:04 0 12.0
> 2014/02/11-00:34:06 14 95.8
> 2014/02/11-00:34:08 30 153.8
> 2014/02/11-00:34:10 34 169.7
> 2014/02/11-00:34:12 64 269.6
> 2014/02/11-00:34:14 58 311.5
> 2014/02/11-00:34:16 81 385.4
> 2014/02/11-00:34:18 94 463.3
> 2014/02/11-00:34:21 114 523.4
> 2014/02/11-00:34:23 114 595.0
> 2014/02/11-00:34:25 153 670.9
> 2014/02/11-00:34:27 149 777.2
> 2014/02/11-00:34:29 165 1359.5
> 2014/02/11-00:34:31 198 1529.3
> 2014/02/11-00:34:33 350 1517.4
> 2014/02/11-00:34:36 369 1523.2
> 2014/02/11-00:34:38 355 1525.3
>
> It's a 16 core, so 1500% is pretty much maxed out. But notice the jump between 149 connections and 198 connections, it's almost double the CPU.
>
> I tried running the lock counting profiler during running and it looks like this
>
> lock id #tries #collisions collisions [%] time [us] duration [%]
> ----- --- ------- ------------ --------------- ---------- -------------
> pollset 1 2369355 71578 3.0210 553249 0.7896
> proc_main 154649 6315238 13325 0.2110 442908 0.6322
> drv_ev_state 16 2311870 14609 0.6319 140019 0.1998
> run_queue 16 9167097 55212 0.6023 117993 0.1684
> proc_link 154649 2173804 3693 0.1699 111909 0.1597
> proc_status 154649 5816882 3037 0.0522 106664 0.1522
> proc_msgq 154649 3972636 3305 0.0832 56049 0.0800
> process_table 1 444931 4711 1.0588 56042 0.0800
> timeofday 1 1323351 6874 0.5194 22078 0.0315
> atom_tab 1 1902423 22 0.0012 2187 0.0031
> timer_wheel 1 582235 1186 0.2037 1322 0.0019
> pollset_rm_list 1 1042209 2035 0.1953 942 0.0013
> make_ref 1 373475 592 0.1585 458 0.0007
> db_hash_slot 576 646402 203 0.0314 244 0.0003
> alcu_allocator 9 18806 16 0.0851 212 0.0003
>
> However, I'm not entirely sure how to interpret it, poll set seems have to do with scheduling, and I'm sure my use of yield is not recommended, but I don't think that would explain the knee I seem to be seeing.
>
> Anyway, I'm looking for any advice from others who might have some experience optimizing who might be able to point out what I'm doing wrong or point to libraries, talks, tools, documentation, or other things I might have missed.
>
> Thanks,
>
> -Anthony
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
More information about the erlang-questions
mailing list