[erlang-questions] Improve performance of IO bounded server written in Erlang via having pollset for each scheduler and bind port to scheduler together with process

Wed Jul 11 08:44:49 CEST 2012

Find the reason :), please compile with

./configure CFLAGS="-DERTS_POLLSET_PER_SCHEDULER -g -O3 -fomit-frame-pointer"

otherwise compiler optimization is disabled. (run ./configure with no
CFLAGS set will include " -g -O3 -fomit-frame-pointer" by default)

2012/7/11 Zabrane Mickael <zabrane3@REDACTED>:
> Hi Wei,
>
> I did some other tests on our 8-Cores 64-bit machine:
>
> before: 157141 rps
> after: 80245 rps
>
> Same behaviour as before ... hope this help.
>
> Regards,
> Zabrane
>
> On Jul 10, 2012, at 4:30 AM, Wei Cao wrote:
>
>> My tests were all keepalived/persistent connections, and there was
>> significant performance gain in these situations. but I wrote a http
>> server with short connection just now, the performance did drop down,
>> I'll find it out later :).
>>
>> 2012/7/10 Zabrane Mickael <zabrane3@REDACTED>:
>>> Hi,
>>>
>>> Performance of our HTTP Web Server drops down after applying your patches.
>>>
>>> Box: Linux F17, 4GB of RAM:
>>> $ lscpu
>>> Architecture:          i686
>>> CPU op-mode(s):        32-bit, 64-bit
>>> Byte Order:            Little Endian
>>> CPU(s):                4
>>> On-line CPU(s) list:   0-3
>>> Thread(s) per core:    1
>>> Core(s) per socket:    4
>>> Socket(s):             1
>>> Vendor ID:             GenuineIntel
>>> CPU family:            6
>>> Model:                 23
>>> Stepping:              7
>>> CPU MHz:               2499.772
>>> BogoMIPS:              4999.54
>>> Virtualization:        VT-x
>>> L1d cache:             32K
>>> L1i cache:             32K
>>> L2 cache:              3072K
>>>
>>> Bench:
>>> before: 77787 rps
>>> after:    53056 rps
>>>
>>> Any hint to explain this result Wei ?
>>>
>>> Regards,
>>> Zabrane
>>>
>>>
>>> On Jul 9, 2012, at 11:01 AM, Wei Cao wrote:
>>>
>>> Hi, all
>>>
>>> We wrote a proxy server in Erlang, the proxy's logic is rather simple,
>>> it listens on some TCP port, establishes new connection from user
>>> client, forward packets back and forth between the client and backend
>>> server after authentication until connection is closed.
>>>
>>> It's very easy to write such a proxy in Erlang, fork a process for
>>> each new user connection and connect to the backend server in the same
>>> process,  the process works like a pipe, sockets from both side is set
>>> to the active once mode, whenever a tcp packet is received from one
>>> socket, the packet will be sent to other socket. (A simplified version
>>> of proxy code is attached at the end of the mail)
>>>
>>> However, the performance is not quite satisfying, the proxy can handle
>>> maximum only 40k requests on our 16 core machine(Intel Xeon L5630,
>>> 2.13GHz) with heavy stress(100 concurrent clients). We then analyzed
>>> the behavior of beam.smp use tools like tcprstat, mpstat, perf, and
>>> SystemTap.
>>>
>>> tcprstat shows QPS is about 40k, have average 1.7 millisecond latency
>>> for each request.
>>>
>>> timestamp count max min avg med stddev 95_max 95_avg 95_std 99_max 99_avg
>>> 99_std
>>> 1341813326 39416 17873 953 1718 1519 724 2919 1609 340 3813 1674 462
>>> 1341813327 40528 9275 884 1645 1493 491 2777 1559 290 3508 1619 409
>>> 1341813328 40009 18105 925 1694 1507 714 2868 1586 328 3753 1650 450
>>>
>>>
>>> mpstat shows 30% CPU is idle,
>>>
>>> 03:30:19 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft
>>> %steal  %guest   %idle
>>> 03:30:20 PM  all   38.69    0.00   21.92    0.00    0.06    7.52
>>> 0.00    0.00   31.80
>>> 03:30:21 PM  all   37.56    0.00   21.99    0.00    0.00    7.50
>>> 0.00    0.00   32.95
>>>
>>> and perf top shows, much time is wasted in scheduler_wait, in spin wait I
>>> guess.
>>>
>>> 9320.00 19.8% scheduler_wait
>>> /home/mingsong.cw/erlangr16b/lib/erlang/erts-5.10/bin/beam.smp
>>> 1813.00  3.9% process_main
>>> /home/mingsong.cw/erlangr16b/lib/erlang/erts-5.10/bin/beam.smp
>>> 1379.00  2.9% _spin_lock
>>> /usr/lib/debug/lib/modules/2.6.32-131.21.1.tb477.el6.x86_64/vmlinux
>>> 1201.00  2.6% schedule
>>> /home/mingsong.cw/erlangr16b/lib/erlang/erts-5.10/bin/beam.smp
>>>
>>> It seems the performance may be associated with scheduler_wait() and
>>> erts_check_io(), with a SystemTap script(attached at the end of this
>>> mail), we can find out how many times the system call epoll_wait is
>>> invoked by beam.smp and each time, how many revents it gets.
>>>
>>> cpu                                       process      times
>>> revents        min        max        avg   timeouts
>>> all
>>> 1754     128042          -          -         73          3
>>> [14]                                      beam.smp        151
>>> 14127         82         97         93          0
>>> [ 5]                                      beam.smp        142
>>> 13291         83         97         93          0
>>> [13]                                      beam.smp        127
>>> 11948         86         96         94          0
>>> [ 6]                                      beam.smp        127
>>> 11836         81         96         93          0
>>> [ 4]                                      beam.smp        121
>>> 11323         81         96         93          0
>>> [15]                                      beam.smp        117
>>> 10935         83         96         93          0
>>> [12]                                      beam.smp        486
>>> 10128          0         96         20          2
>>> [ 1]                                      beam.smp         71
>>> 6549         71        100         92          0
>>> [ 2]                                      beam.smp         62
>>> 5695         82         96         91          0
>>> [ 7]                                      beam.smp         55
>>> 5102         81         95         92          0
>>> [11]                                      beam.smp         52
>>> 4822         85         95         92          0
>>> [ 9]                                      beam.smp         52
>>> 4799         85         95         92          0
>>> [ 8]                                      beam.smp         51
>>> 4680         78         95         91          0
>>> [10]                                      beam.smp         49
>>> 4508         85         97         92          0
>>> [ 3]                                      beam.smp         46
>>> 4211         81         95         91          0
>>> [ 0]                                      beam.smp         44
>>> 4088         83         95         92          0
>>>
>>> The resuls shows, epoll_wait is invoked 1754 times each second, and
>>> get 73 io events in average. This is unacceptable for writing high
>>> performance server. Because if epoll_wait is invoked no more than 2k
>>> times per second, then read/write a packet would cost more than 500ms,
>>> which causes long delay and affects the throughput finally.
>>>
>>> The problem relies on there is only one global pollset in system wide,
>>> so at a time there is no more than one scheduler can call
>>> erts_check_io() to obtain pending io tasks from underlying pollset,
>>> and no scheduler can call erts_check_io() before all pending io
>>> tasks're processed, so for IO bounded application, it's very likely
>>> that a scheduler finish its own job, but must wait idly for other
>>> schedulers to complete theirs.
>>>
>>> Hence, we develops a patch to slove this problem, by having a pollset
>>> for each scheduler, so that each scheduler can invoke erts_check_io()
>>> on its own pollset concurrently. After a scheduler complete its tasks,
>>> it can invoke erts_check_io() immediately no matter what state other
>>> schedulers're in. This patch also handles port migration situation,
>>> all used file descriptors in each port're recorded, when a port is
>>> migrated, these
>>> fd 're removed from original scheduler's pollset, and added to new
>>> scheduler's.
>>>
>>> Bind port to scheduler together with process is also helpful to
>>> performance, it reduces the cost of thread switches and
>>> synchronization, and bound port won't be migrated between schedulers.
>>>
>>> After apply the two patches, with the same pressure(100 concurrent
>>> clients),epoll_wait is invoked 49332  times per second, and get 3
>>> revents each time in average, that is to say, our server responds
>>> quicker and become more realtime.
>>>
>>> cpu                                       process      times
>>> revents        min        max        avg   timeouts
>>> all
>>> 49332     217186          -          -          4          3
>>> [ 2]                                      beam.smp       3219
>>> 16050          2          7          4          0
>>> [11]                                      beam.smp       4275
>>> 16033          1          6          3          0
>>> [ 8]                                      beam.smp       4240
>>> 15992          1          6          3          0
>>> [ 9]                                      beam.smp       4316
>>> 15964          0          6          3          2
>>> [10]                                      beam.smp       4139
>>> 15851          1          6          3          0
>>> [ 3]                                      beam.smp       4256
>>> 15816          1          6          3          0
>>> [ 1]                                      beam.smp       3107
>>> 15800          2          7          5          0
>>> [ 0]                                      beam.smp       3727
>>> 15259          1          6          4          0
>>> [ 7]                                      beam.smp       2810
>>> 14722          3          7          5          0
>>> [13]                                      beam.smp       1981
>>> 11499          4          7          5          0
>>> [15]                                      beam.smp       2285
>>> 10942          3          6          4          0
>>> [14]                                      beam.smp       2258
>>> 10866          3          6          4          0
>>> [ 4]                                      beam.smp       2246
>>> 10849          3          6          4          0
>>> [ 6]                                      beam.smp       2206
>>> 10730          3          6          4          0
>>> [12]                                      beam.smp       2173
>>> 10573          3          6          4          0
>>> [ 5]                                      beam.smp       2093
>>> 10240          3          6          4          0
>>>
>>> scheduler_wait no longer take so much time now,
>>>
>>>             169.00  6.2% process_main                    beam.smp
>>>              55.00  2.0% _spin_lock                      [kernel]
>>>              45.00  1.7% driver_deliver_term             beam.smp
>>>
>>> so is idle CPU time
>>> 04:30:44 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft
>>> %steal  %guest   %idle
>>> 04:30:45 PM  all   60.34    0.00   21.44    0.00    0.06   16.45
>>> 0.00    0.00    1.71
>>> 04:30:46 PM  all   60.99    0.00   21.22    0.00    0.00   16.26
>>> 0.00    0.00    1.52
>>>
>>> and tcprstat shows, QPS is getting 100K, latency is less than 1 millisecond
>>>
>>> timestamp count max min avg med stddev 95_max 95_avg 95_std 99_max 99_avg
>>> 99_std
>>> 1341822689 96078 11592 314 910 817 311 1447 869 228 1777 897 263
>>> 1341822690 100651 24245 209 914 819 381 1458 870 229 1800 899 263
>>>
>>> I also write a extreamly simple keep-alive http server(attached at the
>>> end of mail), to compare performance before and after applying the
>>> patches, mearused with apache ab tool(ab -n 1000000 -c 100 -k ), a 30%
>>> performance gain can be get.
>>>
>>> before
>>> Requests per second:    103671.70 [#/sec] (mean)
>>> Time per request:       0.965 [ms] (mean)
>>>
>>> after
>>> Requests per second:    133701.24 [#/sec] (mean)
>>> Time per request:       0.748 [ms] (mean)
>>>
>>> Patches can be found at github, compile with
>>> ./configure CFLAGS=-DERTS_POLLSET_PER_SCHEDULER
>>>
>>> Pollset per scheduler:
>>>
>>> git fetch git://github.com/weicao/otp.git pollset_per_scheduler
>>>
>>> https://github.com/weicao/otp/compare/weicao:master...weicao:pollset_per_scheduler
>>> https://github.com/weicao/otp/compare/weicao:master...weicao:pollset_per_scheduler.patch
>>>
>>>
>>> Bind port to scheduler:
>>>
>>> git fetch git://github.com/weicao/otp.git bind_port_to_scheduler
>>>
>>> https://github.com/weicao/otp/compare/weicao:pollset_per_scheduler...weicao:bind_port_to_scheduler
>>> https://github.com/weicao/otp/compare/weicao:pollset_per_scheduler...weicao:bind_port_to_scheduler.patch
>>>
>>>
>>> Appendix:
>>>
>>> -----------------------------------
>>> proxy.erl
>>> ------------------------------------
>>>
>>> -module(proxy).
>>> -compile(export_all).
>>>
>>> -define(RECBUF_SIZE, 8192).
>>> -define(ACCEPT_TIMEOUT, 2000).
>>>
>>> start([MyPortAtom, DestIpAtom, DestPortAtom]) ->
>>>   {MyPort, []} =  string:to_integer(atom_to_list(MyPortAtom)),
>>>   DestIp = atom_to_list(DestIpAtom),
>>>   {DestPort, []} = string:to_integer(atom_to_list(DestPortAtom)),
>>>   listen(MyPort, DestIp, DestPort),
>>>   receive Any ->
>>>       io:format("recv ~p~n", [Any])
>>>   end.
>>>
>>> listen(MyPort, DestIp, DestPort) ->
>>>   io:format("start proxy on [local] 0.0.0.0:~p -> [remote] ~p:~p~n",
>>> [MyPort, DestIp, DestPort]),
>>>   case gen_tcp:listen(MyPort,
>>>          [inet,
>>>           {ip, {0,0,0,0}},
>>>           binary,
>>>           {reuseaddr, true},
>>>           {recbuf, ?RECBUF_SIZE},
>>>           {active, false},
>>>           {nodelay, true}
>>>          ]) of
>>>       {ok, Listen} ->
>>>           N = erlang:system_info(schedulers),
>>>           lists:foreach(fun(I) -> accept(Listen, DestIp, DestPort,
>>> I) end, lists:seq(1,N));
>>>       {error, Reason} ->
>>>           io:format("error listen ~p~n", [Reason])
>>>   end.
>>>
>>> accept(Listen, DestIp, DestPort, I) ->
>>>   spawn_opt(?MODULE, loop, [Listen, DestIp, DestPort, I], [{scheduler,
>>> I}]).
>>>
>>> loop(Listen, DestIp, DestPort, I) ->
>>>   case gen_tcp:accept(Listen, ?ACCEPT_TIMEOUT) of
>>>       {ok, S1} ->
>>>           accept(Listen, DestIp, DestPort, I),
>>>           case catch gen_tcp:connect(DestIp, DestPort,
>>>                          [inet, binary, {active, false},
>>>                           {reuseaddr, true}, {nodelay, true}]) of
>>>               {ok, S2} ->
>>>                   io:format("new connection~n"),
>>>                   loop1(S1, S2);
>>>               {error, Reason} ->
>>>                   io:format("error connect ~p~n", [Reason])
>>>           end;
>>>       {error, timeout} ->
>>>           loop(Listen, DestIp, DestPort, I);
>>>       Error ->
>>>           io:format("error accept ~p~n", [Error]),
>>>           accept(Listen, DestIp, DestPort, I)
>>>   end.
>>>
>>> loop1(S1, S2) ->
>>>   active(S1, S2),
>>>   receive
>>>       {tcp, S1, Data} ->
>>>           gen_tcp:send(S2, Data),
>>>           loop1(S1, S2);
>>>       {tcp, S2, Data} ->
>>>           gen_tcp:send(S1, Data),
>>>           loop1(S1, S2);
>>>       {tcp_closed, S1} ->
>>>           io:format("S1 close~n"),
>>>           gen_tcp:close(S1),
>>>           gen_tcp:close(S2);
>>>       {tcp_closed, S2} ->
>>>           io:format("S2 close~n"),
>>>           gen_tcp:close(S1),
>>>           gen_tcp:close(S2)
>>>   end.
>>>
>>> active(S1,S2)->
>>>   inet:setopts(S1, [{active, once}]),
>>>   inet:setopts(S2, [{active, once}]).
>>>
>>> -----------------------------------
>>> epollwait.stp
>>> ------------------------------------
>>> #! /usr/bin/env stap
>>> #
>>> #
>>>
>>> global epoll_timeout_flag, epoll_count, epoll_min, epoll_max,
>>> epoll_times, epoll_timeouts
>>>
>>> probe  syscall.epoll_wait {
>>>   if(timeout > 0) {
>>>       epoll_timeout_flag[pid()] = 1
>>>   }
>>> }
>>>
>>> probe  syscall.epoll_wait.return {
>>> c = cpu()
>>> p = execname()
>>>
>>> epoll_times[c,p] ++
>>> epoll_count[c,p] += $return
>>>
>>> if($return == 0 && pid() in epoll_timeout_flag) {
>>>   epoll_timeouts[c,p] ++
>>>   delete epoll_timeout_flag[pid()]
>>> }
>>>
>>> if(!([c, p] in epoll_min)) {
>>>   epoll_min[c,p] = $return
>>> } else if($return < epoll_min[c,p]) {
>>>   epoll_min[c,p] = $return
>>> }
>>>
>>>
>>> if($return > epoll_max[c,p]) {
>>>   epoll_max[c,p] = $return
>>> }
>>> }
>>>
>>> probe timer.s($1) {
>>> printf ("%4s %45s %10s %10s %10s %10s %10s %10s\n", "cpu",
>>> "process", "times", "revents", "min", "max", "avg", "timeouts" )
>>> foreach ([cpu, process] in epoll_count- limit 20) {
>>>     all_epoll_times += epoll_times[cpu,process]
>>>     all_epoll_count += epoll_count[cpu,process]
>>>     all_epoll_timeouts += epoll_timeouts[cpu,process]
>>> }
>>> printf ("%4s %45s %10d %10d %10s %10s %10d %10d\n",
>>>   "all", "", all_epoll_times, all_epoll_count, "-", "-",
>>>   all_epoll_count == 0? 0:all_epoll_count/all_epoll_times,
>>> all_epoll_timeouts)
>>>
>>> foreach ([cpu, process] in epoll_count- limit 20) {
>>>    printf ("[%2d] %45s %10d %10d %10d %10d %10d %10d\n",
>>>       cpu, process, epoll_times[cpu, process], epoll_count[cpu, process],
>>>       epoll_min[cpu, process], epoll_max[cpu, process],
>>>       epoll_count[cpu,process]/epoll_times[cpu,process],
>>>       epoll_timeouts[cpu,process])
>>> }
>>> delete epoll_count
>>> delete epoll_min
>>> delete epoll_max
>>> delete epoll_times
>>> delete epoll_timeouts
>>> printf
>>> ("--------------------------------------------------------------------------\n\n"
>>> )
>>> }
>>>
>>> ------------------------------------------------
>>> ehttpd.erl
>>> -------------------------------------------------
>>> -module(ehttpd).
>>> -compile(export_all).
>>>
>>> start() ->
>>>   start(8888).
>>> start(Port) ->
>>>   N = erlang:system_info(schedulers),
>>>   listen(Port, N),
>>>   io:format("ehttpd ready with ~b schedulers on port ~b~n", [N, Port]),
>>>
>>>   register(?MODULE, self()),
>>>   receive Any -> io:format("~p~n", [Any]) end.  %% to stop: ehttpd!stop.
>>>
>>> listen(Port, N) ->
>>>   Opts = [inet,
>>>           binary,
>>>           {active, false},
>>>           {recbuf, 8192},
>>>           {nodelay,true},
>>>           {reuseaddr, true}],
>>>
>>>   {ok, S} = gen_tcp:listen(Port, Opts),
>>>   lists:foreach(fun(I)-> spawn_opt(?MODULE, accept, [S, I],
>>> [{scheduler, I}]) end, lists:seq(1, N)).
>>>
>>> accept(S, I) ->
>>>   case gen_tcp:accept(S) of
>>>       {ok, Socket} ->
>>>           spawn_opt(?MODULE, accept, [S, I],[{scheduler,I}] ),
>>>           io:format("new connection @~p~n", [I]),
>>>           loop(Socket,<<>>);
>>>       Error    -> erlang:error(Error)
>>>   end.
>>>
>>> loop(S,B) ->
>>>   inet:setopts(S, [{active, once}]),
>>>   receive
>>>       {tcp, S, Data} ->
>>>           B1 = <<B/binary, Data/binary>>,
>>>           case binary:part(B1,{byte_size(B1), -4}) of
>>>               <<"\r\n\r\n">> ->
>>>                   Response = <<"HTTP/1.1 200 OK\r\nContent-Length:
>>> 12\r\nConnection: Keep-Alive\r\n\r\nhello world!">>,
>>>                   gen_tcp:send(S, Response),
>>>                   loop(S, <<>>);
>>>               _ ->
>>>                   loop(S, B1)
>>>           end;
>>>       {tcp_closed, S} ->
>>>           io:format("connection closed forced~n"),
>>>           gen_tcp:close(S);
>>>       Error ->
>>>           io:format("unexpected message~p~n", [Error]),
>>>           Error
>>>   end.
>>>
>>>
>>> --
>>>
>>> Best,
>>>
>>> Wei Cao
>>> _______________________________________________
>>> erlang-questions mailing list
>>> erlang-questions@REDACTED
>>> http://erlang.org/mailman/listinfo/erlang-questions
>>>
>>>
>>>
>>
>>
>>
>> --
>>
>> Best,
>>
>> Wei Cao
>
>
>

-- 

Best,

Wei Cao