[erlang-questions] Improve performance of IO bounded server written in Erlang via having pollset for each scheduler and bind port to scheduler together with process
Zabrane Mickael
zabrane3@REDACTED
Tue Jul 10 11:08:13 CEST 2012
On Jul 10, 2012, at 4:30 AM, Wei Cao wrote:
> My tests were all keepalived/persistent connections, and there was
> significant performance gain in these situations.
Of course I enabled keepalive on my test.
> but I wrote a http server with short connection just now, the performance did drop down,
> I'll find it out later :).
Great. Let me know if you make any progress.
Regards,
Zabrane
>
> 2012/7/10 Zabrane Mickael <zabrane3@REDACTED>:
>> Hi,
>>
>> Performance of our HTTP Web Server drops down after applying your patches.
>>
>> Box: Linux F17, 4GB of RAM:
>> $ lscpu
>> Architecture: i686
>> CPU op-mode(s): 32-bit, 64-bit
>> Byte Order: Little Endian
>> CPU(s): 4
>> On-line CPU(s) list: 0-3
>> Thread(s) per core: 1
>> Core(s) per socket: 4
>> Socket(s): 1
>> Vendor ID: GenuineIntel
>> CPU family: 6
>> Model: 23
>> Stepping: 7
>> CPU MHz: 2499.772
>> BogoMIPS: 4999.54
>> Virtualization: VT-x
>> L1d cache: 32K
>> L1i cache: 32K
>> L2 cache: 3072K
>>
>> Bench:
>> before: 77787 rps
>> after: 53056 rps
>>
>> Any hint to explain this result Wei ?
>>
>> Regards,
>> Zabrane
>>
>>
>> On Jul 9, 2012, at 11:01 AM, Wei Cao wrote:
>>
>> Hi, all
>>
>> We wrote a proxy server in Erlang, the proxy's logic is rather simple,
>> it listens on some TCP port, establishes new connection from user
>> client, forward packets back and forth between the client and backend
>> server after authentication until connection is closed.
>>
>> It's very easy to write such a proxy in Erlang, fork a process for
>> each new user connection and connect to the backend server in the same
>> process, the process works like a pipe, sockets from both side is set
>> to the active once mode, whenever a tcp packet is received from one
>> socket, the packet will be sent to other socket. (A simplified version
>> of proxy code is attached at the end of the mail)
>>
>> However, the performance is not quite satisfying, the proxy can handle
>> maximum only 40k requests on our 16 core machine(Intel Xeon L5630,
>> 2.13GHz) with heavy stress(100 concurrent clients). We then analyzed
>> the behavior of beam.smp use tools like tcprstat, mpstat, perf, and
>> SystemTap.
>>
>> tcprstat shows QPS is about 40k, have average 1.7 millisecond latency
>> for each request.
>>
>> timestamp count max min avg med stddev 95_max 95_avg 95_std 99_max 99_avg
>> 99_std
>> 1341813326 39416 17873 953 1718 1519 724 2919 1609 340 3813 1674 462
>> 1341813327 40528 9275 884 1645 1493 491 2777 1559 290 3508 1619 409
>> 1341813328 40009 18105 925 1694 1507 714 2868 1586 328 3753 1650 450
>>
>>
>> mpstat shows 30% CPU is idle,
>>
>> 03:30:19 PM CPU %usr %nice %sys %iowait %irq %soft
>> %steal %guest %idle
>> 03:30:20 PM all 38.69 0.00 21.92 0.00 0.06 7.52
>> 0.00 0.00 31.80
>> 03:30:21 PM all 37.56 0.00 21.99 0.00 0.00 7.50
>> 0.00 0.00 32.95
>>
>> and perf top shows, much time is wasted in scheduler_wait, in spin wait I
>> guess.
>>
>> 9320.00 19.8% scheduler_wait
>> /home/mingsong.cw/erlangr16b/lib/erlang/erts-5.10/bin/beam.smp
>> 1813.00 3.9% process_main
>> /home/mingsong.cw/erlangr16b/lib/erlang/erts-5.10/bin/beam.smp
>> 1379.00 2.9% _spin_lock
>> /usr/lib/debug/lib/modules/2.6.32-131.21.1.tb477.el6.x86_64/vmlinux
>> 1201.00 2.6% schedule
>> /home/mingsong.cw/erlangr16b/lib/erlang/erts-5.10/bin/beam.smp
>>
>> It seems the performance may be associated with scheduler_wait() and
>> erts_check_io(), with a SystemTap script(attached at the end of this
>> mail), we can find out how many times the system call epoll_wait is
>> invoked by beam.smp and each time, how many revents it gets.
>>
>> cpu process times
>> revents min max avg timeouts
>> all
>> 1754 128042 - - 73 3
>> [14] beam.smp 151
>> 14127 82 97 93 0
>> [ 5] beam.smp 142
>> 13291 83 97 93 0
>> [13] beam.smp 127
>> 11948 86 96 94 0
>> [ 6] beam.smp 127
>> 11836 81 96 93 0
>> [ 4] beam.smp 121
>> 11323 81 96 93 0
>> [15] beam.smp 117
>> 10935 83 96 93 0
>> [12] beam.smp 486
>> 10128 0 96 20 2
>> [ 1] beam.smp 71
>> 6549 71 100 92 0
>> [ 2] beam.smp 62
>> 5695 82 96 91 0
>> [ 7] beam.smp 55
>> 5102 81 95 92 0
>> [11] beam.smp 52
>> 4822 85 95 92 0
>> [ 9] beam.smp 52
>> 4799 85 95 92 0
>> [ 8] beam.smp 51
>> 4680 78 95 91 0
>> [10] beam.smp 49
>> 4508 85 97 92 0
>> [ 3] beam.smp 46
>> 4211 81 95 91 0
>> [ 0] beam.smp 44
>> 4088 83 95 92 0
>>
>> The resuls shows, epoll_wait is invoked 1754 times each second, and
>> get 73 io events in average. This is unacceptable for writing high
>> performance server. Because if epoll_wait is invoked no more than 2k
>> times per second, then read/write a packet would cost more than 500ms,
>> which causes long delay and affects the throughput finally.
>>
>> The problem relies on there is only one global pollset in system wide,
>> so at a time there is no more than one scheduler can call
>> erts_check_io() to obtain pending io tasks from underlying pollset,
>> and no scheduler can call erts_check_io() before all pending io
>> tasks're processed, so for IO bounded application, it's very likely
>> that a scheduler finish its own job, but must wait idly for other
>> schedulers to complete theirs.
>>
>> Hence, we develops a patch to slove this problem, by having a pollset
>> for each scheduler, so that each scheduler can invoke erts_check_io()
>> on its own pollset concurrently. After a scheduler complete its tasks,
>> it can invoke erts_check_io() immediately no matter what state other
>> schedulers're in. This patch also handles port migration situation,
>> all used file descriptors in each port're recorded, when a port is
>> migrated, these
>> fd 're removed from original scheduler's pollset, and added to new
>> scheduler's.
>>
>> Bind port to scheduler together with process is also helpful to
>> performance, it reduces the cost of thread switches and
>> synchronization, and bound port won't be migrated between schedulers.
>>
>> After apply the two patches, with the same pressure(100 concurrent
>> clients),epoll_wait is invoked 49332 times per second, and get 3
>> revents each time in average, that is to say, our server responds
>> quicker and become more realtime.
>>
>> cpu process times
>> revents min max avg timeouts
>> all
>> 49332 217186 - - 4 3
>> [ 2] beam.smp 3219
>> 16050 2 7 4 0
>> [11] beam.smp 4275
>> 16033 1 6 3 0
>> [ 8] beam.smp 4240
>> 15992 1 6 3 0
>> [ 9] beam.smp 4316
>> 15964 0 6 3 2
>> [10] beam.smp 4139
>> 15851 1 6 3 0
>> [ 3] beam.smp 4256
>> 15816 1 6 3 0
>> [ 1] beam.smp 3107
>> 15800 2 7 5 0
>> [ 0] beam.smp 3727
>> 15259 1 6 4 0
>> [ 7] beam.smp 2810
>> 14722 3 7 5 0
>> [13] beam.smp 1981
>> 11499 4 7 5 0
>> [15] beam.smp 2285
>> 10942 3 6 4 0
>> [14] beam.smp 2258
>> 10866 3 6 4 0
>> [ 4] beam.smp 2246
>> 10849 3 6 4 0
>> [ 6] beam.smp 2206
>> 10730 3 6 4 0
>> [12] beam.smp 2173
>> 10573 3 6 4 0
>> [ 5] beam.smp 2093
>> 10240 3 6 4 0
>>
>> scheduler_wait no longer take so much time now,
>>
>> 169.00 6.2% process_main beam.smp
>> 55.00 2.0% _spin_lock [kernel]
>> 45.00 1.7% driver_deliver_term beam.smp
>>
>> so is idle CPU time
>> 04:30:44 PM CPU %usr %nice %sys %iowait %irq %soft
>> %steal %guest %idle
>> 04:30:45 PM all 60.34 0.00 21.44 0.00 0.06 16.45
>> 0.00 0.00 1.71
>> 04:30:46 PM all 60.99 0.00 21.22 0.00 0.00 16.26
>> 0.00 0.00 1.52
>>
>> and tcprstat shows, QPS is getting 100K, latency is less than 1 millisecond
>>
>> timestamp count max min avg med stddev 95_max 95_avg 95_std 99_max 99_avg
>> 99_std
>> 1341822689 96078 11592 314 910 817 311 1447 869 228 1777 897 263
>> 1341822690 100651 24245 209 914 819 381 1458 870 229 1800 899 263
>>
>> I also write a extreamly simple keep-alive http server(attached at the
>> end of mail), to compare performance before and after applying the
>> patches, mearused with apache ab tool(ab -n 1000000 -c 100 -k ), a 30%
>> performance gain can be get.
>>
>> before
>> Requests per second: 103671.70 [#/sec] (mean)
>> Time per request: 0.965 [ms] (mean)
>>
>> after
>> Requests per second: 133701.24 [#/sec] (mean)
>> Time per request: 0.748 [ms] (mean)
>>
>> Patches can be found at github, compile with
>> ./configure CFLAGS=-DERTS_POLLSET_PER_SCHEDULER
>>
>> Pollset per scheduler:
>>
>> git fetch git://github.com/weicao/otp.git pollset_per_scheduler
>>
>> https://github.com/weicao/otp/compare/weicao:master...weicao:pollset_per_scheduler
>> https://github.com/weicao/otp/compare/weicao:master...weicao:pollset_per_scheduler.patch
>>
>>
>> Bind port to scheduler:
>>
>> git fetch git://github.com/weicao/otp.git bind_port_to_scheduler
>>
>> https://github.com/weicao/otp/compare/weicao:pollset_per_scheduler...weicao:bind_port_to_scheduler
>> https://github.com/weicao/otp/compare/weicao:pollset_per_scheduler...weicao:bind_port_to_scheduler.patch
>>
>>
>> Appendix:
>>
>> -----------------------------------
>> proxy.erl
>> ------------------------------------
>>
>> -module(proxy).
>> -compile(export_all).
>>
>> -define(RECBUF_SIZE, 8192).
>> -define(ACCEPT_TIMEOUT, 2000).
>>
>> start([MyPortAtom, DestIpAtom, DestPortAtom]) ->
>> {MyPort, []} = string:to_integer(atom_to_list(MyPortAtom)),
>> DestIp = atom_to_list(DestIpAtom),
>> {DestPort, []} = string:to_integer(atom_to_list(DestPortAtom)),
>> listen(MyPort, DestIp, DestPort),
>> receive Any ->
>> io:format("recv ~p~n", [Any])
>> end.
>>
>> listen(MyPort, DestIp, DestPort) ->
>> io:format("start proxy on [local] 0.0.0.0:~p -> [remote] ~p:~p~n",
>> [MyPort, DestIp, DestPort]),
>> case gen_tcp:listen(MyPort,
>> [inet,
>> {ip, {0,0,0,0}},
>> binary,
>> {reuseaddr, true},
>> {recbuf, ?RECBUF_SIZE},
>> {active, false},
>> {nodelay, true}
>> ]) of
>> {ok, Listen} ->
>> N = erlang:system_info(schedulers),
>> lists:foreach(fun(I) -> accept(Listen, DestIp, DestPort,
>> I) end, lists:seq(1,N));
>> {error, Reason} ->
>> io:format("error listen ~p~n", [Reason])
>> end.
>>
>> accept(Listen, DestIp, DestPort, I) ->
>> spawn_opt(?MODULE, loop, [Listen, DestIp, DestPort, I], [{scheduler,
>> I}]).
>>
>> loop(Listen, DestIp, DestPort, I) ->
>> case gen_tcp:accept(Listen, ?ACCEPT_TIMEOUT) of
>> {ok, S1} ->
>> accept(Listen, DestIp, DestPort, I),
>> case catch gen_tcp:connect(DestIp, DestPort,
>> [inet, binary, {active, false},
>> {reuseaddr, true}, {nodelay, true}]) of
>> {ok, S2} ->
>> io:format("new connection~n"),
>> loop1(S1, S2);
>> {error, Reason} ->
>> io:format("error connect ~p~n", [Reason])
>> end;
>> {error, timeout} ->
>> loop(Listen, DestIp, DestPort, I);
>> Error ->
>> io:format("error accept ~p~n", [Error]),
>> accept(Listen, DestIp, DestPort, I)
>> end.
>>
>> loop1(S1, S2) ->
>> active(S1, S2),
>> receive
>> {tcp, S1, Data} ->
>> gen_tcp:send(S2, Data),
>> loop1(S1, S2);
>> {tcp, S2, Data} ->
>> gen_tcp:send(S1, Data),
>> loop1(S1, S2);
>> {tcp_closed, S1} ->
>> io:format("S1 close~n"),
>> gen_tcp:close(S1),
>> gen_tcp:close(S2);
>> {tcp_closed, S2} ->
>> io:format("S2 close~n"),
>> gen_tcp:close(S1),
>> gen_tcp:close(S2)
>> end.
>>
>> active(S1,S2)->
>> inet:setopts(S1, [{active, once}]),
>> inet:setopts(S2, [{active, once}]).
>>
>> -----------------------------------
>> epollwait.stp
>> ------------------------------------
>> #! /usr/bin/env stap
>> #
>> #
>>
>> global epoll_timeout_flag, epoll_count, epoll_min, epoll_max,
>> epoll_times, epoll_timeouts
>>
>> probe syscall.epoll_wait {
>> if(timeout > 0) {
>> epoll_timeout_flag[pid()] = 1
>> }
>> }
>>
>> probe syscall.epoll_wait.return {
>> c = cpu()
>> p = execname()
>>
>> epoll_times[c,p] ++
>> epoll_count[c,p] += $return
>>
>> if($return == 0 && pid() in epoll_timeout_flag) {
>> epoll_timeouts[c,p] ++
>> delete epoll_timeout_flag[pid()]
>> }
>>
>> if(!([c, p] in epoll_min)) {
>> epoll_min[c,p] = $return
>> } else if($return < epoll_min[c,p]) {
>> epoll_min[c,p] = $return
>> }
>>
>>
>> if($return > epoll_max[c,p]) {
>> epoll_max[c,p] = $return
>> }
>> }
>>
>> probe timer.s($1) {
>> printf ("%4s %45s %10s %10s %10s %10s %10s %10s\n", "cpu",
>> "process", "times", "revents", "min", "max", "avg", "timeouts" )
>> foreach ([cpu, process] in epoll_count- limit 20) {
>> all_epoll_times += epoll_times[cpu,process]
>> all_epoll_count += epoll_count[cpu,process]
>> all_epoll_timeouts += epoll_timeouts[cpu,process]
>> }
>> printf ("%4s %45s %10d %10d %10s %10s %10d %10d\n",
>> "all", "", all_epoll_times, all_epoll_count, "-", "-",
>> all_epoll_count == 0? 0:all_epoll_count/all_epoll_times,
>> all_epoll_timeouts)
>>
>> foreach ([cpu, process] in epoll_count- limit 20) {
>> printf ("[%2d] %45s %10d %10d %10d %10d %10d %10d\n",
>> cpu, process, epoll_times[cpu, process], epoll_count[cpu, process],
>> epoll_min[cpu, process], epoll_max[cpu, process],
>> epoll_count[cpu,process]/epoll_times[cpu,process],
>> epoll_timeouts[cpu,process])
>> }
>> delete epoll_count
>> delete epoll_min
>> delete epoll_max
>> delete epoll_times
>> delete epoll_timeouts
>> printf
>> ("--------------------------------------------------------------------------\n\n"
>> )
>> }
>>
>> ------------------------------------------------
>> ehttpd.erl
>> -------------------------------------------------
>> -module(ehttpd).
>> -compile(export_all).
>>
>> start() ->
>> start(8888).
>> start(Port) ->
>> N = erlang:system_info(schedulers),
>> listen(Port, N),
>> io:format("ehttpd ready with ~b schedulers on port ~b~n", [N, Port]),
>>
>> register(?MODULE, self()),
>> receive Any -> io:format("~p~n", [Any]) end. %% to stop: ehttpd!stop.
>>
>> listen(Port, N) ->
>> Opts = [inet,
>> binary,
>> {active, false},
>> {recbuf, 8192},
>> {nodelay,true},
>> {reuseaddr, true}],
>>
>> {ok, S} = gen_tcp:listen(Port, Opts),
>> lists:foreach(fun(I)-> spawn_opt(?MODULE, accept, [S, I],
>> [{scheduler, I}]) end, lists:seq(1, N)).
>>
>> accept(S, I) ->
>> case gen_tcp:accept(S) of
>> {ok, Socket} ->
>> spawn_opt(?MODULE, accept, [S, I],[{scheduler,I}] ),
>> io:format("new connection @~p~n", [I]),
>> loop(Socket,<<>>);
>> Error -> erlang:error(Error)
>> end.
>>
>> loop(S,B) ->
>> inet:setopts(S, [{active, once}]),
>> receive
>> {tcp, S, Data} ->
>> B1 = <<B/binary, Data/binary>>,
>> case binary:part(B1,{byte_size(B1), -4}) of
>> <<"\r\n\r\n">> ->
>> Response = <<"HTTP/1.1 200 OK\r\nContent-Length:
>> 12\r\nConnection: Keep-Alive\r\n\r\nhello world!">>,
>> gen_tcp:send(S, Response),
>> loop(S, <<>>);
>> _ ->
>> loop(S, B1)
>> end;
>> {tcp_closed, S} ->
>> io:format("connection closed forced~n"),
>> gen_tcp:close(S);
>> Error ->
>> io:format("unexpected message~p~n", [Error]),
>> Error
>> end.
>>
>>
>> --
>>
>> Best,
>>
>> Wei Cao
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
>>
>>
>>
>
>
>
> --
>
> Best,
>
> Wei Cao
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20120710/40b12cf2/attachment.htm>
More information about the erlang-questions
mailing list