[erlang-questions] Improve performance of IO bounded server written in Erlang via having pollset for each scheduler and bind port to scheduler together with process

Tue Jul 10 23:36:56 CEST 2012

I dont think so!

When I tested Wei's code teh first time, I suppressed all io:format statements and made
the inner loop lot simpler. Something like that:

[...]
loop(S,B) ->
    inet:setopts(S, [{active, once}]),
    receive
        {tcp, S, _Data} ->
            Response = <<"HTTP/1.1 200 OK\r\nContent-Length:
12\r\nConnection: Keep-Alive\r\n\r\nhello world!">>,
                    gen_tcp:send(S, Response),
                    loop(S, <<>>)
[...]

didn't change anything in my case.

Hope this help!

Regards,
Zabrane

On Jul 10, 2012, at 11:14 PM, erlang wrote:

> Hi, all (first time)
> on start - i'm sorry for my englisch language.
> i think that the problem is on all lines where you use "io:format"
> In my system when i start wrote in erlang I used many times io:format for debug. It wos big bottleneck.
> 
> 
> JanM
> 
> 
> W dniu 2012-07-09 11:01, Wei Cao pisze:
>> Hi, all
>> 
>> We wrote a proxy server in Erlang, the proxy's logic is rather simple,
>> it listens on some TCP port, establishes new connection from user
>> client, forward packets back and forth between the client and backend
>> server after authentication until connection is closed.
>> 
>> It's very easy to write such a proxy in Erlang, fork a process for
>> each new user connection and connect to the backend server in the same
>> process,  the process works like a pipe, sockets from both side is set
>> to the active once mode, whenever a tcp packet is received from one
>> socket, the packet will be sent to other socket. (A simplified version
>> of proxy code is attached at the end of the mail)
>> 
>> However, the performance is not quite satisfying, the proxy can handle
>> maximum only 40k requests on our 16 core machine(Intel Xeon L5630,
>> 2.13GHz) with heavy stress(100 concurrent clients). We then analyzed
>> the behavior of beam.smp use tools like tcprstat, mpstat, perf, and
>> SystemTap.
>> 
>> tcprstat shows QPS is about 40k, have average 1.7 millisecond latency
>> for each request.
>> 
>> timestamp	count	max	min	avg	med	stddev	95_max	95_avg	95_std	99_max	99_avg	99_std
>> 1341813326	39416	17873	953	1718	1519	724	2919	1609	340	3813	1674	462
>> 1341813327	40528	9275	884	1645	1493	491	2777	1559	290	3508	1619	409
>> 1341813328	40009	18105	925	1694	1507	714	2868	1586	328	3753	1650	450
>> 
>> 
>> mpstat shows 30% CPU is idle,
>> 
>> 03:30:19 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft
>> %steal  %guest   %idle
>> 03:30:20 PM  all   38.69    0.00   21.92    0.00    0.06    7.52
>> 0.00    0.00   31.80
>> 03:30:21 PM  all   37.56    0.00   21.99    0.00    0.00    7.50
>> 0.00    0.00   32.95
>> 
>> and perf top shows, much time is wasted in scheduler_wait, in spin wait I guess.
>> 
>> 9320.00 19.8% scheduler_wait
>> /home/mingsong.cw/erlangr16b/lib/erlang/erts-5.10/bin/beam.smp
>> 1813.00  3.9% process_main
>> /home/mingsong.cw/erlangr16b/lib/erlang/erts-5.10/bin/beam.smp
>> 1379.00  2.9% _spin_lock
>> /usr/lib/debug/lib/modules/2.6.32-131.21.1.tb477.el6.x86_64/vmlinux
>> 1201.00  2.6% schedule
>> /home/mingsong.cw/erlangr16b/lib/erlang/erts-5.10/bin/beam.smp
>> 
>> It seems the performance may be associated with scheduler_wait() and
>> erts_check_io(), with a SystemTap script(attached at the end of this
>> mail), we can find out how many times the system call epoll_wait is
>> invoked by beam.smp and each time, how many revents it gets.
>> 
>>  cpu                                       process      times
>> revents        min        max        avg   timeouts
>>  all
>> 1754     128042          -          -         73          3
>> [14]                                      beam.smp        151
>> 14127         82         97         93          0
>> [ 5]                                      beam.smp        142
>> 13291         83         97         93          0
>> [13]                                      beam.smp        127
>> 11948         86         96         94          0
>> [ 6]                                      beam.smp        127
>> 11836         81         96         93          0
>> [ 4]                                      beam.smp        121
>> 11323         81         96         93          0
>> [15]                                      beam.smp        117
>> 10935         83         96         93          0
>> [12]                                      beam.smp        486
>> 10128          0         96         20          2
>> [ 1]                                      beam.smp         71
>> 6549         71        100         92          0
>> [ 2]                                      beam.smp         62
>> 5695         82         96         91          0
>> [ 7]                                      beam.smp         55
>> 5102         81         95         92          0
>> [11]                                      beam.smp         52
>> 4822         85         95         92          0
>> [ 9]                                      beam.smp         52
>> 4799         85         95         92          0
>> [ 8]                                      beam.smp         51
>> 4680         78         95         91          0
>> [10]                                      beam.smp         49
>> 4508         85         97         92          0
>> [ 3]                                      beam.smp         46
>> 4211         81         95         91          0
>> [ 0]                                      beam.smp         44
>> 4088         83         95         92          0
>> 
>> The resuls shows, epoll_wait is invoked 1754 times each second, and
>> get 73 io events in average. This is unacceptable for writing high
>> performance server. Because if epoll_wait is invoked no more than 2k
>> times per second, then read/write a packet would cost more than 500ms,
>> which causes long delay and affects the throughput finally.
>> 
>> The problem relies on there is only one global pollset in system wide,
>> so at a time there is no more than one scheduler can call
>> erts_check_io() to obtain pending io tasks from underlying pollset,
>> and no scheduler can call erts_check_io() before all pending io
>> tasks're processed, so for IO bounded application, it's very likely
>> that a scheduler finish its own job, but must wait idly for other
>> schedulers to complete theirs.
>> 
>> Hence, we develops a patch to slove this problem, by having a pollset
>> for each scheduler, so that each scheduler can invoke erts_check_io()
>> on its own pollset concurrently. After a scheduler complete its tasks,
>> it can invoke erts_check_io() immediately no matter what state other
>> schedulers're in. This patch also handles port migration situation,
>> all used file descriptors in each port're recorded, when a port is
>> migrated, these
>> fd 're removed from original scheduler's pollset, and added to new scheduler's.
>> 
>> Bind port to scheduler together with process is also helpful to
>> performance, it reduces the cost of thread switches and
>> synchronization, and bound port won't be migrated between schedulers.
>> 
>> After apply the two patches, with the same pressure(100 concurrent
>> clients),epoll_wait is invoked 49332  times per second, and get 3
>> revents each time in average, that is to say, our server responds
>> quicker and become more realtime.
>> 
>>  cpu                                       process      times
>> revents        min        max        avg   timeouts
>>  all
>> 49332     217186          -          -          4          3
>> [ 2]                                      beam.smp       3219
>> 16050          2          7          4          0
>> [11]                                      beam.smp       4275
>> 16033          1          6          3          0
>> [ 8]                                      beam.smp       4240
>> 15992          1          6          3          0
>> [ 9]                                      beam.smp       4316
>> 15964          0          6          3          2
>> [10]                                      beam.smp       4139
>> 15851          1          6          3          0
>> [ 3]                                      beam.smp       4256
>> 15816          1          6          3          0
>> [ 1]                                      beam.smp       3107
>> 15800          2          7          5          0
>> [ 0]                                      beam.smp       3727
>> 15259          1          6          4          0
>> [ 7]                                      beam.smp       2810
>> 14722          3          7          5          0
>> [13]                                      beam.smp       1981
>> 11499          4          7          5          0
>> [15]                                      beam.smp       2285
>> 10942          3          6          4          0
>> [14]                                      beam.smp       2258
>> 10866          3          6          4          0
>> [ 4]                                      beam.smp       2246
>> 10849          3          6          4          0
>> [ 6]                                      beam.smp       2206
>> 10730          3          6          4          0
>> [12]                                      beam.smp       2173
>> 10573          3          6          4          0
>> [ 5]                                      beam.smp       2093
>> 10240          3          6          4          0
>> 
>> scheduler_wait no longer take so much time now,
>> 
>>               169.00  6.2% process_main                    beam.smp
>>                55.00  2.0% _spin_lock                      [kernel]
>>                45.00  1.7% driver_deliver_term             beam.smp
>> 
>> so is idle CPU time
>> 04:30:44 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft
>> %steal  %guest   %idle
>> 04:30:45 PM  all   60.34    0.00   21.44    0.00    0.06   16.45
>> 0.00    0.00    1.71
>> 04:30:46 PM  all   60.99    0.00   21.22    0.00    0.00   16.26
>> 0.00    0.00    1.52
>> 
>> and tcprstat shows, QPS is getting 100K, latency is less than 1 millisecond
>> 
>> timestamp	count	max	min	avg	med	stddev	95_max	95_avg	95_std	99_max	99_avg	99_std
>> 1341822689	96078	11592	314	910	817	311	1447	869	228	1777	897	263
>> 1341822690	100651	24245	209	914	819	381	1458	870	229	1800	899	263
>> 
>> I also write a extreamly simple keep-alive http server(attached at the
>> end of mail), to compare performance before and after applying the
>> patches, mearused with apache ab tool(ab -n 1000000 -c 100 -k ), a 30%
>> performance gain can be get.
>> 
>> before
>> Requests per second:    103671.70 [#/sec] (mean)
>> Time per request:       0.965 [ms] (mean)
>> 
>> after
>> Requests per second:    133701.24 [#/sec] (mean)
>> Time per request:       0.748 [ms] (mean)
>> 
>> Patches can be found at github, compile with
>> ./configure CFLAGS=-DERTS_POLLSET_PER_SCHEDULER
>> 
>> Pollset per scheduler:
>> 
>> git fetch git://github.com/weicao/otp.git pollset_per_scheduler
>> 
>> https://github.com/weicao/otp/compare/weicao:master...weicao:pollset_per_scheduler
>> https://github.com/weicao/otp/compare/weicao:master...weicao:pollset_per_scheduler.patch
>> 
>> 
>> Bind port to scheduler:
>> 
>> git fetch git://github.com/weicao/otp.git bind_port_to_scheduler
>> 
>> https://github.com/weicao/otp/compare/weicao:pollset_per_scheduler...weicao:bind_port_to_scheduler
>> https://github.com/weicao/otp/compare/weicao:pollset_per_scheduler...weicao:bind_port_to_scheduler.patch
>> 
>> 
>> Appendix:
>> 
>> -----------------------------------
>> proxy.erl
>> ------------------------------------
>> 
>> -module(proxy).
>> -compile(export_all).
>> 
>> -define(RECBUF_SIZE, 8192).
>> -define(ACCEPT_TIMEOUT, 2000).
>> 
>> start([MyPortAtom, DestIpAtom, DestPortAtom]) ->
>>     {MyPort, []} =  string:to_integer(atom_to_list(MyPortAtom)),
>>     DestIp = atom_to_list(DestIpAtom),
>>     {DestPort, []} = string:to_integer(atom_to_list(DestPortAtom)),
>>     listen(MyPort, DestIp, DestPort),
>>     receive Any ->
>>         io:format("recv ~p~n", [Any])
>>     end.
>> 
>> listen(MyPort, DestIp, DestPort) ->
>>     io:format("start proxy on [local] 0.0.0.0:~p -> [remote] ~p:~p~n",
>> [MyPort, DestIp, DestPort]),
>>     case gen_tcp:listen(MyPort,
>>            [inet,
>>             {ip, {0,0,0,0}},
>>             binary,
>>             {reuseaddr, true},
>>             {recbuf, ?RECBUF_SIZE},
>>             {active, false},
>>             {nodelay, true}
>>            ]) of
>>         {ok, Listen} ->
>>             N = erlang:system_info(schedulers),
>>             lists:foreach(fun(I) -> accept(Listen, DestIp, DestPort,
>> I) end, lists:seq(1,N));
>>         {error, Reason} ->
>>             io:format("error listen ~p~n", [Reason])
>>     end.
>> 
>> accept(Listen, DestIp, DestPort, I) ->
>>     spawn_opt(?MODULE, loop, [Listen, DestIp, DestPort, I], [{scheduler, I}]).
>> 
>> loop(Listen, DestIp, DestPort, I) ->
>>     case gen_tcp:accept(Listen, ?ACCEPT_TIMEOUT) of
>>         {ok, S1} ->
>>             accept(Listen, DestIp, DestPort, I),
>>             case catch gen_tcp:connect(DestIp, DestPort,
>>                            [inet, binary, {active, false},
>>                             {reuseaddr, true}, {nodelay, true}]) of
>>                 {ok, S2} ->
>>                     io:format("new connection~n"),
>>                     loop1(S1, S2);
>>                 {error, Reason} ->
>>                     io:format("error connect ~p~n", [Reason])
>>             end;
>>         {error, timeout} ->
>>             loop(Listen, DestIp, DestPort, I);
>>         Error ->
>>             io:format("error accept ~p~n", [Error]),
>>             accept(Listen, DestIp, DestPort, I)
>>     end.
>> 
>> loop1(S1, S2) ->
>>     active(S1, S2),
>>     receive
>>         {tcp, S1, Data} ->
>>             gen_tcp:send(S2, Data),
>>             loop1(S1, S2);
>>         {tcp, S2, Data} ->
>>             gen_tcp:send(S1, Data),
>>             loop1(S1, S2);
>>         {tcp_closed, S1} ->
>>             io:format("S1 close~n"),
>>             gen_tcp:close(S1),
>>             gen_tcp:close(S2);
>>         {tcp_closed, S2} ->
>>             io:format("S2 close~n"),
>>             gen_tcp:close(S1),
>>             gen_tcp:close(S2)
>>     end.
>> 
>> active(S1,S2)->
>>     inet:setopts(S1, [{active, once}]),
>>     inet:setopts(S2, [{active, once}]).
>> 
>> -----------------------------------
>> epollwait.stp
>> ------------------------------------
>> #! /usr/bin/env stap
>> #
>> #
>> 
>> global epoll_timeout_flag, epoll_count, epoll_min, epoll_max,
>> epoll_times, epoll_timeouts
>> 
>> probe  syscall.epoll_wait {
>>     if(timeout > 0) {
>>         epoll_timeout_flag[pid()] = 1
>>     }
>> }
>> 
>> probe  syscall.epoll_wait.return {
>>   c = cpu()
>>   p = execname()
>> 
>>   epoll_times[c,p] ++
>>   epoll_count[c,p] += $return
>> 
>>   if($return == 0 && pid() in epoll_timeout_flag) {
>>     epoll_timeouts[c,p] ++
>>     delete epoll_timeout_flag[pid()]
>>   }
>> 
>>   if(!([c, p] in epoll_min)) {
>>     epoll_min[c,p] = $return
>>   } else if($return < epoll_min[c,p]) {
>>     epoll_min[c,p] = $return
>>   }
>> 
>> 
>>   if($return > epoll_max[c,p]) {
>>     epoll_max[c,p] = $return
>>   }
>> }
>> 
>> probe timer.s($1) {
>>   printf ("%4s %45s %10s %10s %10s %10s %10s %10s\n", "cpu",
>> "process", "times", "revents", "min", "max", "avg", "timeouts" )
>>   foreach ([cpu, process] in epoll_count- limit 20) {
>>       all_epoll_times += epoll_times[cpu,process]
>>       all_epoll_count += epoll_count[cpu,process]
>>       all_epoll_timeouts += epoll_timeouts[cpu,process]
>>   }
>>   printf ("%4s %45s %10d %10d %10s %10s %10d %10d\n",
>>     "all", "", all_epoll_times, all_epoll_count, "-", "-",
>>     all_epoll_count == 0? 0:all_epoll_count/all_epoll_times, all_epoll_timeouts)
>> 
>>   foreach ([cpu, process] in epoll_count- limit 20) {
>>      printf ("[%2d] %45s %10d %10d %10d %10d %10d %10d\n",
>>         cpu, process, epoll_times[cpu, process], epoll_count[cpu, process],
>>         epoll_min[cpu, process], epoll_max[cpu, process],
>>         epoll_count[cpu,process]/epoll_times[cpu,process],
>>         epoll_timeouts[cpu,process])
>>   }
>>   delete epoll_count
>>   delete epoll_min
>>   delete epoll_max
>>   delete epoll_times
>>   delete epoll_timeouts
>>   printf ("--------------------------------------------------------------------------\n\n"
>> )
>> }
>> 
>> ------------------------------------------------
>> ehttpd.erl
>> -------------------------------------------------
>> -module(ehttpd).
>> -compile(export_all).
>> 
>> start() ->
>>     start(8888).
>> start(Port) ->
>>     N = erlang:system_info(schedulers),
>>     listen(Port, N),
>>     io:format("ehttpd ready with ~b schedulers on port ~b~n", [N, Port]),
>> 
>>     register(?MODULE, self()),
>>     receive Any -> io:format("~p~n", [Any]) end.  %% to stop: ehttpd!stop.
>> 
>> listen(Port, N) ->
>>     Opts = [inet,
>>             binary,
>>             {active, false},
>>             {recbuf, 8192},
>>             {nodelay,true},
>>             {reuseaddr, true}],
>> 
>>     {ok, S} = gen_tcp:listen(Port, Opts),
>>     lists:foreach(fun(I)-> spawn_opt(?MODULE, accept, [S, I],
>> [{scheduler, I}]) end, lists:seq(1, N)).
>> 
>> accept(S, I) ->
>>     case gen_tcp:accept(S) of
>>         {ok, Socket} ->
>>             spawn_opt(?MODULE, accept, [S, I],[{scheduler,I}] ),
>>             io:format("new connection @~p~n", [I]),
>>             loop(Socket,<<>>);
>>         Error    -> erlang:error(Error)
>>     end.
>> 
>> loop(S,B) ->
>>     inet:setopts(S, [{active, once}]),
>>     receive
>>         {tcp, S, Data} ->
>>             B1 = <<B/binary, Data/binary>>,
>>             case binary:part(B1,{byte_size(B1), -4}) of
>>                 <<"\r\n\r\n">> ->
>>                     Response = <<"HTTP/1.1 200 OK\r\nContent-Length:
>> 12\r\nConnection: Keep-Alive\r\n\r\nhello world!">>,
>>                     gen_tcp:send(S, Response),
>>                     loop(S, <<>>);
>>                 _ ->
>>                     loop(S, B1)
>>             end;
>>         {tcp_closed, S} ->
>>             io:format("connection closed forced~n"),
>>             gen_tcp:close(S);
>>         Error ->
>>             io:format("unexpected message~p~n", [Error]),
>>             Error
>>     end.
>> 
>> 
> 
> 
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20120710/f52f4b03/attachment.htm>