[erlang-questions] Improve performance of IO bounded server written in Erlang via having pollset for each scheduler and bind port to scheduler together with process
Wei Cao
cyg.cao@REDACTED
Mon Jul 9 11:01:12 CEST 2012
Hi, all
We wrote a proxy server in Erlang, the proxy's logic is rather simple,
it listens on some TCP port, establishes new connection from user
client, forward packets back and forth between the client and backend
server after authentication until connection is closed.
It's very easy to write such a proxy in Erlang, fork a process for
each new user connection and connect to the backend server in the same
process, the process works like a pipe, sockets from both side is set
to the active once mode, whenever a tcp packet is received from one
socket, the packet will be sent to other socket. (A simplified version
of proxy code is attached at the end of the mail)
However, the performance is not quite satisfying, the proxy can handle
maximum only 40k requests on our 16 core machine(Intel Xeon L5630,
2.13GHz) with heavy stress(100 concurrent clients). We then analyzed
the behavior of beam.smp use tools like tcprstat, mpstat, perf, and
SystemTap.
tcprstat shows QPS is about 40k, have average 1.7 millisecond latency
for each request.
timestamp count max min avg med stddev 95_max 95_avg 95_std 99_max 99_avg 99_std
1341813326 39416 17873 953 1718 1519 724 2919 1609 340 3813 1674 462
1341813327 40528 9275 884 1645 1493 491 2777 1559 290 3508 1619 409
1341813328 40009 18105 925 1694 1507 714 2868 1586 328 3753 1650 450
mpstat shows 30% CPU is idle,
03:30:19 PM CPU %usr %nice %sys %iowait %irq %soft
%steal %guest %idle
03:30:20 PM all 38.69 0.00 21.92 0.00 0.06 7.52
0.00 0.00 31.80
03:30:21 PM all 37.56 0.00 21.99 0.00 0.00 7.50
0.00 0.00 32.95
and perf top shows, much time is wasted in scheduler_wait, in spin wait I guess.
9320.00 19.8% scheduler_wait
/home/mingsong.cw/erlangr16b/lib/erlang/erts-5.10/bin/beam.smp
1813.00 3.9% process_main
/home/mingsong.cw/erlangr16b/lib/erlang/erts-5.10/bin/beam.smp
1379.00 2.9% _spin_lock
/usr/lib/debug/lib/modules/2.6.32-131.21.1.tb477.el6.x86_64/vmlinux
1201.00 2.6% schedule
/home/mingsong.cw/erlangr16b/lib/erlang/erts-5.10/bin/beam.smp
It seems the performance may be associated with scheduler_wait() and
erts_check_io(), with a SystemTap script(attached at the end of this
mail), we can find out how many times the system call epoll_wait is
invoked by beam.smp and each time, how many revents it gets.
cpu process times
revents min max avg timeouts
all
1754 128042 - - 73 3
[14] beam.smp 151
14127 82 97 93 0
[ 5] beam.smp 142
13291 83 97 93 0
[13] beam.smp 127
11948 86 96 94 0
[ 6] beam.smp 127
11836 81 96 93 0
[ 4] beam.smp 121
11323 81 96 93 0
[15] beam.smp 117
10935 83 96 93 0
[12] beam.smp 486
10128 0 96 20 2
[ 1] beam.smp 71
6549 71 100 92 0
[ 2] beam.smp 62
5695 82 96 91 0
[ 7] beam.smp 55
5102 81 95 92 0
[11] beam.smp 52
4822 85 95 92 0
[ 9] beam.smp 52
4799 85 95 92 0
[ 8] beam.smp 51
4680 78 95 91 0
[10] beam.smp 49
4508 85 97 92 0
[ 3] beam.smp 46
4211 81 95 91 0
[ 0] beam.smp 44
4088 83 95 92 0
The resuls shows, epoll_wait is invoked 1754 times each second, and
get 73 io events in average. This is unacceptable for writing high
performance server. Because if epoll_wait is invoked no more than 2k
times per second, then read/write a packet would cost more than 500ms,
which causes long delay and affects the throughput finally.
The problem relies on there is only one global pollset in system wide,
so at a time there is no more than one scheduler can call
erts_check_io() to obtain pending io tasks from underlying pollset,
and no scheduler can call erts_check_io() before all pending io
tasks're processed, so for IO bounded application, it's very likely
that a scheduler finish its own job, but must wait idly for other
schedulers to complete theirs.
Hence, we develops a patch to slove this problem, by having a pollset
for each scheduler, so that each scheduler can invoke erts_check_io()
on its own pollset concurrently. After a scheduler complete its tasks,
it can invoke erts_check_io() immediately no matter what state other
schedulers're in. This patch also handles port migration situation,
all used file descriptors in each port're recorded, when a port is
migrated, these
fd 're removed from original scheduler's pollset, and added to new scheduler's.
Bind port to scheduler together with process is also helpful to
performance, it reduces the cost of thread switches and
synchronization, and bound port won't be migrated between schedulers.
After apply the two patches, with the same pressure(100 concurrent
clients),epoll_wait is invoked 49332 times per second, and get 3
revents each time in average, that is to say, our server responds
quicker and become more realtime.
cpu process times
revents min max avg timeouts
all
49332 217186 - - 4 3
[ 2] beam.smp 3219
16050 2 7 4 0
[11] beam.smp 4275
16033 1 6 3 0
[ 8] beam.smp 4240
15992 1 6 3 0
[ 9] beam.smp 4316
15964 0 6 3 2
[10] beam.smp 4139
15851 1 6 3 0
[ 3] beam.smp 4256
15816 1 6 3 0
[ 1] beam.smp 3107
15800 2 7 5 0
[ 0] beam.smp 3727
15259 1 6 4 0
[ 7] beam.smp 2810
14722 3 7 5 0
[13] beam.smp 1981
11499 4 7 5 0
[15] beam.smp 2285
10942 3 6 4 0
[14] beam.smp 2258
10866 3 6 4 0
[ 4] beam.smp 2246
10849 3 6 4 0
[ 6] beam.smp 2206
10730 3 6 4 0
[12] beam.smp 2173
10573 3 6 4 0
[ 5] beam.smp 2093
10240 3 6 4 0
scheduler_wait no longer take so much time now,
169.00 6.2% process_main beam.smp
55.00 2.0% _spin_lock [kernel]
45.00 1.7% driver_deliver_term beam.smp
so is idle CPU time
04:30:44 PM CPU %usr %nice %sys %iowait %irq %soft
%steal %guest %idle
04:30:45 PM all 60.34 0.00 21.44 0.00 0.06 16.45
0.00 0.00 1.71
04:30:46 PM all 60.99 0.00 21.22 0.00 0.00 16.26
0.00 0.00 1.52
and tcprstat shows, QPS is getting 100K, latency is less than 1 millisecond
timestamp count max min avg med stddev 95_max 95_avg 95_std 99_max 99_avg 99_std
1341822689 96078 11592 314 910 817 311 1447 869 228 1777 897 263
1341822690 100651 24245 209 914 819 381 1458 870 229 1800 899 263
I also write a extreamly simple keep-alive http server(attached at the
end of mail), to compare performance before and after applying the
patches, mearused with apache ab tool(ab -n 1000000 -c 100 -k ), a 30%
performance gain can be get.
before
Requests per second: 103671.70 [#/sec] (mean)
Time per request: 0.965 [ms] (mean)
after
Requests per second: 133701.24 [#/sec] (mean)
Time per request: 0.748 [ms] (mean)
Patches can be found at github, compile with
./configure CFLAGS=-DERTS_POLLSET_PER_SCHEDULER
Pollset per scheduler:
git fetch git://github.com/weicao/otp.git pollset_per_scheduler
https://github.com/weicao/otp/compare/weicao:master...weicao:pollset_per_scheduler
https://github.com/weicao/otp/compare/weicao:master...weicao:pollset_per_scheduler.patch
Bind port to scheduler:
git fetch git://github.com/weicao/otp.git bind_port_to_scheduler
https://github.com/weicao/otp/compare/weicao:pollset_per_scheduler...weicao:bind_port_to_scheduler
https://github.com/weicao/otp/compare/weicao:pollset_per_scheduler...weicao:bind_port_to_scheduler.patch
Appendix:
-----------------------------------
proxy.erl
------------------------------------
-module(proxy).
-compile(export_all).
-define(RECBUF_SIZE, 8192).
-define(ACCEPT_TIMEOUT, 2000).
start([MyPortAtom, DestIpAtom, DestPortAtom]) ->
{MyPort, []} = string:to_integer(atom_to_list(MyPortAtom)),
DestIp = atom_to_list(DestIpAtom),
{DestPort, []} = string:to_integer(atom_to_list(DestPortAtom)),
listen(MyPort, DestIp, DestPort),
receive Any ->
io:format("recv ~p~n", [Any])
end.
listen(MyPort, DestIp, DestPort) ->
io:format("start proxy on [local] 0.0.0.0:~p -> [remote] ~p:~p~n",
[MyPort, DestIp, DestPort]),
case gen_tcp:listen(MyPort,
[inet,
{ip, {0,0,0,0}},
binary,
{reuseaddr, true},
{recbuf, ?RECBUF_SIZE},
{active, false},
{nodelay, true}
]) of
{ok, Listen} ->
N = erlang:system_info(schedulers),
lists:foreach(fun(I) -> accept(Listen, DestIp, DestPort,
I) end, lists:seq(1,N));
{error, Reason} ->
io:format("error listen ~p~n", [Reason])
end.
accept(Listen, DestIp, DestPort, I) ->
spawn_opt(?MODULE, loop, [Listen, DestIp, DestPort, I], [{scheduler, I}]).
loop(Listen, DestIp, DestPort, I) ->
case gen_tcp:accept(Listen, ?ACCEPT_TIMEOUT) of
{ok, S1} ->
accept(Listen, DestIp, DestPort, I),
case catch gen_tcp:connect(DestIp, DestPort,
[inet, binary, {active, false},
{reuseaddr, true}, {nodelay, true}]) of
{ok, S2} ->
io:format("new connection~n"),
loop1(S1, S2);
{error, Reason} ->
io:format("error connect ~p~n", [Reason])
end;
{error, timeout} ->
loop(Listen, DestIp, DestPort, I);
Error ->
io:format("error accept ~p~n", [Error]),
accept(Listen, DestIp, DestPort, I)
end.
loop1(S1, S2) ->
active(S1, S2),
receive
{tcp, S1, Data} ->
gen_tcp:send(S2, Data),
loop1(S1, S2);
{tcp, S2, Data} ->
gen_tcp:send(S1, Data),
loop1(S1, S2);
{tcp_closed, S1} ->
io:format("S1 close~n"),
gen_tcp:close(S1),
gen_tcp:close(S2);
{tcp_closed, S2} ->
io:format("S2 close~n"),
gen_tcp:close(S1),
gen_tcp:close(S2)
end.
active(S1,S2)->
inet:setopts(S1, [{active, once}]),
inet:setopts(S2, [{active, once}]).
-----------------------------------
epollwait.stp
------------------------------------
#! /usr/bin/env stap
#
#
global epoll_timeout_flag, epoll_count, epoll_min, epoll_max,
epoll_times, epoll_timeouts
probe syscall.epoll_wait {
if(timeout > 0) {
epoll_timeout_flag[pid()] = 1
}
}
probe syscall.epoll_wait.return {
c = cpu()
p = execname()
epoll_times[c,p] ++
epoll_count[c,p] += $return
if($return == 0 && pid() in epoll_timeout_flag) {
epoll_timeouts[c,p] ++
delete epoll_timeout_flag[pid()]
}
if(!([c, p] in epoll_min)) {
epoll_min[c,p] = $return
} else if($return < epoll_min[c,p]) {
epoll_min[c,p] = $return
}
if($return > epoll_max[c,p]) {
epoll_max[c,p] = $return
}
}
probe timer.s($1) {
printf ("%4s %45s %10s %10s %10s %10s %10s %10s\n", "cpu",
"process", "times", "revents", "min", "max", "avg", "timeouts" )
foreach ([cpu, process] in epoll_count- limit 20) {
all_epoll_times += epoll_times[cpu,process]
all_epoll_count += epoll_count[cpu,process]
all_epoll_timeouts += epoll_timeouts[cpu,process]
}
printf ("%4s %45s %10d %10d %10s %10s %10d %10d\n",
"all", "", all_epoll_times, all_epoll_count, "-", "-",
all_epoll_count == 0? 0:all_epoll_count/all_epoll_times, all_epoll_timeouts)
foreach ([cpu, process] in epoll_count- limit 20) {
printf ("[%2d] %45s %10d %10d %10d %10d %10d %10d\n",
cpu, process, epoll_times[cpu, process], epoll_count[cpu, process],
epoll_min[cpu, process], epoll_max[cpu, process],
epoll_count[cpu,process]/epoll_times[cpu,process],
epoll_timeouts[cpu,process])
}
delete epoll_count
delete epoll_min
delete epoll_max
delete epoll_times
delete epoll_timeouts
printf ("--------------------------------------------------------------------------\n\n"
)
}
------------------------------------------------
ehttpd.erl
-------------------------------------------------
-module(ehttpd).
-compile(export_all).
start() ->
start(8888).
start(Port) ->
N = erlang:system_info(schedulers),
listen(Port, N),
io:format("ehttpd ready with ~b schedulers on port ~b~n", [N, Port]),
register(?MODULE, self()),
receive Any -> io:format("~p~n", [Any]) end. %% to stop: ehttpd!stop.
listen(Port, N) ->
Opts = [inet,
binary,
{active, false},
{recbuf, 8192},
{nodelay,true},
{reuseaddr, true}],
{ok, S} = gen_tcp:listen(Port, Opts),
lists:foreach(fun(I)-> spawn_opt(?MODULE, accept, [S, I],
[{scheduler, I}]) end, lists:seq(1, N)).
accept(S, I) ->
case gen_tcp:accept(S) of
{ok, Socket} ->
spawn_opt(?MODULE, accept, [S, I],[{scheduler,I}] ),
io:format("new connection @~p~n", [I]),
loop(Socket,<<>>);
Error -> erlang:error(Error)
end.
loop(S,B) ->
inet:setopts(S, [{active, once}]),
receive
{tcp, S, Data} ->
B1 = <<B/binary, Data/binary>>,
case binary:part(B1,{byte_size(B1), -4}) of
<<"\r\n\r\n">> ->
Response = <<"HTTP/1.1 200 OK\r\nContent-Length:
12\r\nConnection: Keep-Alive\r\n\r\nhello world!">>,
gen_tcp:send(S, Response),
loop(S, <<>>);
_ ->
loop(S, B1)
end;
{tcp_closed, S} ->
io:format("connection closed forced~n"),
gen_tcp:close(S);
Error ->
io:format("unexpected message~p~n", [Error]),
Error
end.
--
Best,
Wei Cao
More information about the erlang-questions
mailing list