<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">Hi,<div><br></div><div>Performance of our HTTP Web Server drops down after applying your patches.</div><div><br></div><div><b>Box</b>: Linux F17, 4GB of RAM:</div><div><div>$ lscpu </div><div>Architecture:          i686</div><div>CPU op-mode(s):        32-bit, 64-bit</div><div>Byte Order:            Little Endian</div><div><b>CPU(s):                4</b></div><div>On-line CPU(s) list:   0-3</div><div>Thread(s) per core:    1</div><div>Core(s) per socket:    4</div><div>Socket(s):             1</div><div>Vendor ID:             GenuineIntel</div><div>CPU family:            6</div><div>Model:                 23</div><div>Stepping:              7</div><div>CPU MHz:               2499.772</div><div>BogoMIPS:              4999.54</div><div>Virtualization:        VT-x</div><div>L1d cache:             32K</div><div>L1i cache:             32K</div><div>L2 cache:              3072K</div></div><div><br></div><div><b>Bench:</b></div><div><b>before</b>: 77787 rps</div><div><b>after</b>:    53056 rps</div><div><br></div><div>Any hint to explain this result Wei ?</div><div><br></div><div><div><div>Regards,</div><div>Zabrane</div><div><br></div></div><div><br class="webkit-block-placeholder"></div><div><div>On Jul 9, 2012, at 11:01 AM, Wei Cao wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div>Hi, all<br><br>We wrote a proxy server in Erlang, the proxy's logic is rather simple,<br>it listens on some TCP port, establishes new connection from user<br>client, forward packets back and forth between the client and backend<br>server after authentication until connection is closed.<br><br>It's very easy to write such a proxy in Erlang, fork a process for<br>each new user connection and connect to the backend server in the same<br>process,  the process works like a pipe, sockets from both side is set<br>to the active once mode, whenever a tcp packet is received from one<br>socket, the packet will be sent to other socket. (A simplified version<br>of proxy code is attached at the end of the mail)<br><br>However, the performance is not quite satisfying, the proxy can handle<br>maximum only 40k requests on our 16 core machine(Intel Xeon L5630,<br>2.13GHz) with heavy stress(100 concurrent clients). We then analyzed<br>the behavior of beam.smp use tools like tcprstat, mpstat, perf, and<br>SystemTap.<br><br>tcprstat shows QPS is about 40k, have average 1.7 millisecond latency<br>for each request.<br><br>timestamp<span class="Apple-tab-span" style="white-space:pre">     </span>count<span class="Apple-tab-span" style="white-space:pre">       </span>max<span class="Apple-tab-span" style="white-space:pre"> </span>min<span class="Apple-tab-span" style="white-space:pre"> </span>avg<span class="Apple-tab-span" style="white-space:pre"> </span>med<span class="Apple-tab-span" style="white-space:pre"> </span>stddev<span class="Apple-tab-span" style="white-space:pre">      </span>95_max<span class="Apple-tab-span" style="white-space:pre">      </span>95_avg<span class="Apple-tab-span" style="white-space:pre">      </span>95_std<span class="Apple-tab-span" style="white-space:pre">      </span>99_max<span class="Apple-tab-span" style="white-space:pre">      </span>99_avg<span class="Apple-tab-span" style="white-space:pre">      </span>99_std<br>1341813326<span class="Apple-tab-span" style="white-space:pre">  </span>39416<span class="Apple-tab-span" style="white-space:pre">       </span>17873<span class="Apple-tab-span" style="white-space:pre">       </span>953<span class="Apple-tab-span" style="white-space:pre"> </span>1718<span class="Apple-tab-span" style="white-space:pre">        </span>1519<span class="Apple-tab-span" style="white-space:pre">        </span>724<span class="Apple-tab-span" style="white-space:pre"> </span>2919<span class="Apple-tab-span" style="white-space:pre">        </span>1609<span class="Apple-tab-span" style="white-space:pre">        </span>340<span class="Apple-tab-span" style="white-space:pre"> </span>3813<span class="Apple-tab-span" style="white-space:pre">        </span>1674<span class="Apple-tab-span" style="white-space:pre">        </span>462<br>1341813327<span class="Apple-tab-span" style="white-space:pre">     </span>40528<span class="Apple-tab-span" style="white-space:pre">       </span>9275<span class="Apple-tab-span" style="white-space:pre">        </span>884<span class="Apple-tab-span" style="white-space:pre"> </span>1645<span class="Apple-tab-span" style="white-space:pre">        </span>1493<span class="Apple-tab-span" style="white-space:pre">        </span>491<span class="Apple-tab-span" style="white-space:pre"> </span>2777<span class="Apple-tab-span" style="white-space:pre">        </span>1559<span class="Apple-tab-span" style="white-space:pre">        </span>290<span class="Apple-tab-span" style="white-space:pre"> </span>3508<span class="Apple-tab-span" style="white-space:pre">        </span>1619<span class="Apple-tab-span" style="white-space:pre">        </span>409<br>1341813328<span class="Apple-tab-span" style="white-space:pre">     </span>40009<span class="Apple-tab-span" style="white-space:pre">       </span>18105<span class="Apple-tab-span" style="white-space:pre">       </span>925<span class="Apple-tab-span" style="white-space:pre"> </span>1694<span class="Apple-tab-span" style="white-space:pre">        </span>1507<span class="Apple-tab-span" style="white-space:pre">        </span>714<span class="Apple-tab-span" style="white-space:pre"> </span>2868<span class="Apple-tab-span" style="white-space:pre">        </span>1586<span class="Apple-tab-span" style="white-space:pre">        </span>328<span class="Apple-tab-span" style="white-space:pre"> </span>3753<span class="Apple-tab-span" style="white-space:pre">        </span>1650<span class="Apple-tab-span" style="white-space:pre">        </span>450<br><br><br>mpstat shows 30% CPU is idle,<br><br>03:30:19 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft<br>%steal  %guest   %idle<br>03:30:20 PM  all   38.69    0.00   21.92    0.00    0.06    7.52<br>0.00    0.00   31.80<br>03:30:21 PM  all   37.56    0.00   21.99    0.00    0.00    7.50<br>0.00    0.00   32.95<br><br>and perf top shows, much time is wasted in scheduler_wait, in spin wait I guess.<br><br>9320.00 19.8% scheduler_wait<br>/home/mingsong.cw/erlangr16b/lib/erlang/erts-5.10/bin/beam.smp<br>1813.00  3.9% process_main<br>/home/mingsong.cw/erlangr16b/lib/erlang/erts-5.10/bin/beam.smp<br>1379.00  2.9% _spin_lock<br>/usr/lib/debug/lib/modules/2.6.32-131.21.1.tb477.el6.x86_64/vmlinux<br>1201.00  2.6% schedule<br>/home/mingsong.cw/erlangr16b/lib/erlang/erts-5.10/bin/beam.smp<br><br>It seems the performance may be associated with scheduler_wait() and<br>erts_check_io(), with a SystemTap script(attached at the end of this<br>mail), we can find out how many times the system call epoll_wait is<br>invoked by beam.smp and each time, how many revents it gets.<br><br> cpu                                       process      times<br>revents        min        max        avg   timeouts<br> all<br>1754     128042          -          -         73          3<br>[14]                                      beam.smp        151<br>14127         82         97         93          0<br>[ 5]                                      beam.smp        142<br>13291         83         97         93          0<br>[13]                                      beam.smp        127<br>11948         86         96         94          0<br>[ 6]                                      beam.smp        127<br>11836         81         96         93          0<br>[ 4]                                      beam.smp        121<br>11323         81         96         93          0<br>[15]                                      beam.smp        117<br>10935         83         96         93          0<br>[12]                                      beam.smp        486<br>10128          0         96         20          2<br>[ 1]                                      beam.smp         71<br>6549         71        100         92          0<br>[ 2]                                      beam.smp         62<br>5695         82         96         91          0<br>[ 7]                                      beam.smp         55<br>5102         81         95         92          0<br>[11]                                      beam.smp         52<br>4822         85         95         92          0<br>[ 9]                                      beam.smp         52<br>4799         85         95         92          0<br>[ 8]                                      beam.smp         51<br>4680         78         95         91          0<br>[10]                                      beam.smp         49<br>4508         85         97         92          0<br>[ 3]                                      beam.smp         46<br>4211         81         95         91          0<br>[ 0]                                      beam.smp         44<br>4088         83         95         92          0<br><br>The resuls shows, epoll_wait is invoked 1754 times each second, and<br>get 73 io events in average. This is unacceptable for writing high<br>performance server. Because if epoll_wait is invoked no more than 2k<br>times per second, then read/write a packet would cost more than 500ms,<br>which causes long delay and affects the throughput finally.<br><br>The problem relies on there is only one global pollset in system wide,<br>so at a time there is no more than one scheduler can call<br>erts_check_io() to obtain pending io tasks from underlying pollset,<br>and no scheduler can call erts_check_io() before all pending io<br>tasks're processed, so for IO bounded application, it's very likely<br>that a scheduler finish its own job, but must wait idly for other<br>schedulers to complete theirs.<br><br>Hence, we develops a patch to slove this problem, by having a pollset<br>for each scheduler, so that each scheduler can invoke erts_check_io()<br>on its own pollset concurrently. After a scheduler complete its tasks,<br>it can invoke erts_check_io() immediately no matter what state other<br>schedulers're in. This patch also handles port migration situation,<br>all used file descriptors in each port're recorded, when a port is<br>migrated, these<br>fd 're removed from original scheduler's pollset, and added to new scheduler's.<br><br>Bind port to scheduler together with process is also helpful to<br>performance, it reduces the cost of thread switches and<br>synchronization, and bound port won't be migrated between schedulers.<br><br>After apply the two patches, with the same pressure(100 concurrent<br>clients),epoll_wait is invoked 49332  times per second, and get 3<br>revents each time in average, that is to say, our server responds<br>quicker and become more realtime.<br><br> cpu                                       process      times<br>revents        min        max        avg   timeouts<br> all<br>49332     217186          -          -          4          3<br>[ 2]                                      beam.smp       3219<br>16050          2          7          4          0<br>[11]                                      beam.smp       4275<br>16033          1          6          3          0<br>[ 8]                                      beam.smp       4240<br>15992          1          6          3          0<br>[ 9]                                      beam.smp       4316<br>15964          0          6          3          2<br>[10]                                      beam.smp       4139<br>15851          1          6          3          0<br>[ 3]                                      beam.smp       4256<br>15816          1          6          3          0<br>[ 1]                                      beam.smp       3107<br>15800          2          7          5          0<br>[ 0]                                      beam.smp       3727<br>15259          1          6          4          0<br>[ 7]                                      beam.smp       2810<br>14722          3          7          5          0<br>[13]                                      beam.smp       1981<br>11499          4          7          5          0<br>[15]                                      beam.smp       2285<br>10942          3          6          4          0<br>[14]                                      beam.smp       2258<br>10866          3          6          4          0<br>[ 4]                                      beam.smp       2246<br>10849          3          6          4          0<br>[ 6]                                      beam.smp       2206<br>10730          3          6          4          0<br>[12]                                      beam.smp       2173<br>10573          3          6          4          0<br>[ 5]                                      beam.smp       2093<br>10240          3          6          4          0<br><br>scheduler_wait no longer take so much time now,<br><br>              169.00  6.2% process_main                    beam.smp<br>               55.00  2.0% _spin_lock                      [kernel]<br>               45.00  1.7% driver_deliver_term             beam.smp<br><br>so is idle CPU time<br>04:30:44 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft<br>%steal  %guest   %idle<br>04:30:45 PM  all   60.34    0.00   21.44    0.00    0.06   16.45<br>0.00    0.00    1.71<br>04:30:46 PM  all   60.99    0.00   21.22    0.00    0.00   16.26<br>0.00    0.00    1.52<br><br>and tcprstat shows, QPS is getting 100K, latency is less than 1 millisecond<br><br>timestamp<span class="Apple-tab-span" style="white-space:pre"> </span>count<span class="Apple-tab-span" style="white-space:pre">       </span>max<span class="Apple-tab-span" style="white-space:pre"> </span>min<span class="Apple-tab-span" style="white-space:pre"> </span>avg<span class="Apple-tab-span" style="white-space:pre"> </span>med<span class="Apple-tab-span" style="white-space:pre"> </span>stddev<span class="Apple-tab-span" style="white-space:pre">      </span>95_max<span class="Apple-tab-span" style="white-space:pre">      </span>95_avg<span class="Apple-tab-span" style="white-space:pre">      </span>95_std<span class="Apple-tab-span" style="white-space:pre">      </span>99_max<span class="Apple-tab-span" style="white-space:pre">      </span>99_avg<span class="Apple-tab-span" style="white-space:pre">      </span>99_std<br>1341822689<span class="Apple-tab-span" style="white-space:pre">  </span>96078<span class="Apple-tab-span" style="white-space:pre">       </span>11592<span class="Apple-tab-span" style="white-space:pre">       </span>314<span class="Apple-tab-span" style="white-space:pre"> </span>910<span class="Apple-tab-span" style="white-space:pre"> </span>817<span class="Apple-tab-span" style="white-space:pre"> </span>311<span class="Apple-tab-span" style="white-space:pre"> </span>1447<span class="Apple-tab-span" style="white-space:pre">        </span>869<span class="Apple-tab-span" style="white-space:pre"> </span>228<span class="Apple-tab-span" style="white-space:pre"> </span>1777<span class="Apple-tab-span" style="white-space:pre">        </span>897<span class="Apple-tab-span" style="white-space:pre"> </span>263<br>1341822690<span class="Apple-tab-span" style="white-space:pre">     </span>100651<span class="Apple-tab-span" style="white-space:pre">      </span>24245<span class="Apple-tab-span" style="white-space:pre">       </span>209<span class="Apple-tab-span" style="white-space:pre"> </span>914<span class="Apple-tab-span" style="white-space:pre"> </span>819<span class="Apple-tab-span" style="white-space:pre"> </span>381<span class="Apple-tab-span" style="white-space:pre"> </span>1458<span class="Apple-tab-span" style="white-space:pre">        </span>870<span class="Apple-tab-span" style="white-space:pre"> </span>229<span class="Apple-tab-span" style="white-space:pre"> </span>1800<span class="Apple-tab-span" style="white-space:pre">        </span>899<span class="Apple-tab-span" style="white-space:pre"> </span>263<br><br>I also write a extreamly simple keep-alive http server(attached at the<br>end of mail), to compare performance before and after applying the<br>patches, mearused with apache ab tool(ab -n 1000000 -c 100 -k ), a 30%<br>performance gain can be get.<br><br>before<br>Requests per second:    103671.70 [#/sec] (mean)<br>Time per request:       0.965 [ms] (mean)<br><br>after<br>Requests per second:    133701.24 [#/sec] (mean)<br>Time per request:       0.748 [ms] (mean)<br><br>Patches can be found at github, compile with<br>./configure CFLAGS=-DERTS_POLLSET_PER_SCHEDULER<br><br>Pollset per scheduler:<br><br>git fetch <a href="git://github.com/weicao/otp.git">git://github.com/weicao/otp.git</a> pollset_per_scheduler<br><br><a href="https://github.com/weicao/otp/compare/weicao:master...weicao:pollset_per_scheduler">https://github.com/weicao/otp/compare/weicao:master...weicao:pollset_per_scheduler</a><br>https://github.com/weicao/otp/compare/weicao:master...weicao:pollset_per_scheduler.patch<br><br><br>Bind port to scheduler:<br><br>git fetch git://github.com/weicao/otp.git bind_port_to_scheduler<br><br>https://github.com/weicao/otp/compare/weicao:pollset_per_scheduler...weicao:bind_port_to_scheduler<br>https://github.com/weicao/otp/compare/weicao:pollset_per_scheduler...weicao:bind_port_to_scheduler.patch<br><br><br>Appendix:<br><br>-----------------------------------<br>proxy.erl<br>------------------------------------<br><br>-module(proxy).<br>-compile(export_all).<br><br>-define(RECBUF_SIZE, 8192).<br>-define(ACCEPT_TIMEOUT, 2000).<br><br>start([MyPortAtom, DestIpAtom, DestPortAtom]) -><br>    {MyPort, []} =  string:to_integer(atom_to_list(MyPortAtom)),<br>    DestIp = atom_to_list(DestIpAtom),<br>    {DestPort, []} = string:to_integer(atom_to_list(DestPortAtom)),<br>    listen(MyPort, DestIp, DestPort),<br>    receive Any -><br>        io:format("recv ~p~n", [Any])<br>    end.<br><br>listen(MyPort, DestIp, DestPort) -><br>    io:format("start proxy on [local] 0.0.0.0:~p -> [remote] ~p:~p~n",<br>[MyPort, DestIp, DestPort]),<br>    case gen_tcp:listen(MyPort,<br>           [inet,<br>            {ip, {0,0,0,0}},<br>            binary,<br>            {reuseaddr, true},<br>            {recbuf, ?RECBUF_SIZE},<br>            {active, false},<br>            {nodelay, true}<br>           ]) of<br>        {ok, Listen} -><br>            N = erlang:system_info(schedulers),<br>            lists:foreach(fun(I) -> accept(Listen, DestIp, DestPort,<br>I) end, lists:seq(1,N));<br>        {error, Reason} -><br>            io:format("error listen ~p~n", [Reason])<br>    end.<br><br>accept(Listen, DestIp, DestPort, I) -><br>    spawn_opt(?MODULE, loop, [Listen, DestIp, DestPort, I], [{scheduler, I}]).<br><br>loop(Listen, DestIp, DestPort, I) -><br>    case gen_tcp:accept(Listen, ?ACCEPT_TIMEOUT) of<br>        {ok, S1} -><br>            accept(Listen, DestIp, DestPort, I),<br>            case catch gen_tcp:connect(DestIp, DestPort,<br>                           [inet, binary, {active, false},<br>                            {reuseaddr, true}, {nodelay, true}]) of<br>                {ok, S2} -><br>                    io:format("new connection~n"),<br>                    loop1(S1, S2);<br>                {error, Reason} -><br>                    io:format("error connect ~p~n", [Reason])<br>            end;<br>        {error, timeout} -><br>            loop(Listen, DestIp, DestPort, I);<br>        Error -><br>            io:format("error accept ~p~n", [Error]),<br>            accept(Listen, DestIp, DestPort, I)<br>    end.<br><br>loop1(S1, S2) -><br>    active(S1, S2),<br>    receive<br>        {tcp, S1, Data} -><br>            gen_tcp:send(S2, Data),<br>            loop1(S1, S2);<br>        {tcp, S2, Data} -><br>            gen_tcp:send(S1, Data),<br>            loop1(S1, S2);<br>        {tcp_closed, S1} -><br>            io:format("S1 close~n"),<br>            gen_tcp:close(S1),<br>            gen_tcp:close(S2);<br>        {tcp_closed, S2} -><br>            io:format("S2 close~n"),<br>            gen_tcp:close(S1),<br>            gen_tcp:close(S2)<br>    end.<br><br>active(S1,S2)-><br>    inet:setopts(S1, [{active, once}]),<br>    inet:setopts(S2, [{active, once}]).<br><br>-----------------------------------<br>epollwait.stp<br>------------------------------------<br>#! /usr/bin/env stap<br>#<br>#<br><br>global epoll_timeout_flag, epoll_count, epoll_min, epoll_max,<br>epoll_times, epoll_timeouts<br><br>probe  syscall.epoll_wait {<br>    if(timeout > 0) {<br>        epoll_timeout_flag[pid()] = 1<br>    }<br>}<br><br>probe  syscall.epoll_wait.return {<br>  c = cpu()<br>  p = execname()<br><br>  epoll_times[c,p] ++<br>  epoll_count[c,p] += $return<br><br>  if($return == 0 && pid() in epoll_timeout_flag) {<br>    epoll_timeouts[c,p] ++<br>    delete epoll_timeout_flag[pid()]<br>  }<br><br>  if(!([c, p] in epoll_min)) {<br>    epoll_min[c,p] = $return<br>  } else if($return < epoll_min[c,p]) {<br>    epoll_min[c,p] = $return<br>  }<br><br><br>  if($return > epoll_max[c,p]) {<br>    epoll_max[c,p] = $return<br>  }<br>}<br><br>probe timer.s($1) {<br>  printf ("%4s %45s %10s %10s %10s %10s %10s %10s\n", "cpu",<br>"process", "times", "revents", "min", "max", "avg", "timeouts" )<br>  foreach ([cpu, process] in epoll_count- limit 20) {<br>      all_epoll_times += epoll_times[cpu,process]<br>      all_epoll_count += epoll_count[cpu,process]<br>      all_epoll_timeouts += epoll_timeouts[cpu,process]<br>  }<br>  printf ("%4s %45s %10d %10d %10s %10s %10d %10d\n",<br>    "all", "", all_epoll_times, all_epoll_count, "-", "-",<br>    all_epoll_count == 0? 0:all_epoll_count/all_epoll_times, all_epoll_timeouts)<br><br>  foreach ([cpu, process] in epoll_count- limit 20) {<br>     printf ("[%2d] %45s %10d %10d %10d %10d %10d %10d\n",<br>        cpu, process, epoll_times[cpu, process], epoll_count[cpu, process],<br>        epoll_min[cpu, process], epoll_max[cpu, process],<br>        epoll_count[cpu,process]/epoll_times[cpu,process],<br>        epoll_timeouts[cpu,process])<br>  }<br>  delete epoll_count<br>  delete epoll_min<br>  delete epoll_max<br>  delete epoll_times<br>  delete epoll_timeouts<br>  printf ("--------------------------------------------------------------------------\n\n"<br>)<br>}<br><br>------------------------------------------------<br>ehttpd.erl<br>-------------------------------------------------<br>-module(ehttpd).<br>-compile(export_all).<br><br>start() -><br>    start(8888).<br>start(Port) -><br>    N = erlang:system_info(schedulers),<br>    listen(Port, N),<br>    io:format("ehttpd ready with ~b schedulers on port ~b~n", [N, Port]),<br><br>    register(?MODULE, self()),<br>    receive Any -> io:format("~p~n", [Any]) end.  %% to stop: ehttpd!stop.<br><br>listen(Port, N) -><br>    Opts = [inet,<br>            binary,<br>            {active, false},<br>            {recbuf, 8192},<br>            {nodelay,true},<br>            {reuseaddr, true}],<br><br>    {ok, S} = gen_tcp:listen(Port, Opts),<br>    lists:foreach(fun(I)-> spawn_opt(?MODULE, accept, [S, I],<br>[{scheduler, I}]) end, lists:seq(1, N)).<br><br>accept(S, I) -><br>    case gen_tcp:accept(S) of<br>        {ok, Socket} -><br>            spawn_opt(?MODULE, accept, [S, I],[{scheduler,I}] ),<br>            io:format("new connection @~p~n", [I]),<br>            loop(Socket,<<>>);<br>        Error    -> erlang:error(Error)<br>    end.<br><br>loop(S,B) -><br>    inet:setopts(S, [{active, once}]),<br>    receive<br>        {tcp, S, Data} -><br>            B1 = <<B/binary, Data/binary>>,<br>            case binary:part(B1,{byte_size(B1), -4}) of<br>                <<"\r\n\r\n">> -><br>                    Response = <<"HTTP/1.1 200 OK\r\nContent-Length:<br>12\r\nConnection: Keep-Alive\r\n\r\nhello world!">>,<br>                    gen_tcp:send(S, Response),<br>                    loop(S, <<>>);<br>                _ -><br>                    loop(S, B1)<br>            end;<br>        {tcp_closed, S} -><br>            io:format("connection closed forced~n"),<br>            gen_tcp:close(S);<br>        Error -><br>            io:format("unexpected message~p~n", [Error]),<br>            Error<br>    end.<br><br><br>-- <br><br>Best,<br><br>Wei Cao<br>_______________________________________________<br>erlang-questions mailing list<br>erlang-questions@erlang.org<br>http://erlang.org/mailman/listinfo/erlang-questions<br></div></blockquote></div><br><div apple-content-edited="true">
<span class="Apple-style-span" style="border-collapse: separate; color: rgb(0, 0, 0); font-family: Helvetica; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-align: auto; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; -webkit-border-vertical-spacing: 0px; -webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div><br></div></span></div></div></body></html>