[erlang-questions] Improve performance of IO bounded server written in Erlang via having pollset for each scheduler and bind port to scheduler together with process

Mon Jul 9 11:01:12 CEST 2012

Hi, all

We wrote a proxy server in Erlang, the proxy's logic is rather simple,
it listens on some TCP port, establishes new connection from user
client, forward packets back and forth between the client and backend
server after authentication until connection is closed.

It's very easy to write such a proxy in Erlang, fork a process for
each new user connection and connect to the backend server in the same
process,  the process works like a pipe, sockets from both side is set
to the active once mode, whenever a tcp packet is received from one
socket, the packet will be sent to other socket. (A simplified version
of proxy code is attached at the end of the mail)

However, the performance is not quite satisfying, the proxy can handle
maximum only 40k requests on our 16 core machine(Intel Xeon L5630,
2.13GHz) with heavy stress(100 concurrent clients). We then analyzed
the behavior of beam.smp use tools like tcprstat, mpstat, perf, and
SystemTap.

tcprstat shows QPS is about 40k, have average 1.7 millisecond latency
for each request.

timestamp	count	max	min	avg	med	stddev	95_max	95_avg	95_std	99_max	99_avg	99_std
1341813326	39416	17873	953	1718	1519	724	2919	1609	340	3813	1674	462
1341813327	40528	9275	884	1645	1493	491	2777	1559	290	3508	1619	409
1341813328	40009	18105	925	1694	1507	714	2868	1586	328	3753	1650	450

mpstat shows 30% CPU is idle,

03:30:19 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft
%steal  %guest   %idle
03:30:20 PM  all   38.69    0.00   21.92    0.00    0.06    7.52
0.00    0.00   31.80
03:30:21 PM  all   37.56    0.00   21.99    0.00    0.00    7.50
0.00    0.00   32.95

and perf top shows, much time is wasted in scheduler_wait, in spin wait I guess.

9320.00 19.8% scheduler_wait
/home/mingsong.cw/erlangr16b/lib/erlang/erts-5.10/bin/beam.smp
1813.00  3.9% process_main
/home/mingsong.cw/erlangr16b/lib/erlang/erts-5.10/bin/beam.smp
1379.00  2.9% _spin_lock
/usr/lib/debug/lib/modules/2.6.32-131.21.1.tb477.el6.x86_64/vmlinux
1201.00  2.6% schedule
/home/mingsong.cw/erlangr16b/lib/erlang/erts-5.10/bin/beam.smp

It seems the performance may be associated with scheduler_wait() and
erts_check_io(), with a SystemTap script(attached at the end of this
mail), we can find out how many times the system call epoll_wait is
invoked by beam.smp and each time, how many revents it gets.

 cpu                                       process      times
revents        min        max        avg   timeouts
 all
1754     128042          -          -         73          3
[14]                                      beam.smp        151
14127         82         97         93          0
[ 5]                                      beam.smp        142
13291         83         97         93          0
[13]                                      beam.smp        127
11948         86         96         94          0
[ 6]                                      beam.smp        127
11836         81         96         93          0
[ 4]                                      beam.smp        121
11323         81         96         93          0
[15]                                      beam.smp        117
10935         83         96         93          0
[12]                                      beam.smp        486
10128          0         96         20          2
[ 1]                                      beam.smp         71
6549         71        100         92          0
[ 2]                                      beam.smp         62
5695         82         96         91          0
[ 7]                                      beam.smp         55
5102         81         95         92          0
[11]                                      beam.smp         52
4822         85         95         92          0
[ 9]                                      beam.smp         52
4799         85         95         92          0
[ 8]                                      beam.smp         51
4680         78         95         91          0
[10]                                      beam.smp         49
4508         85         97         92          0
[ 3]                                      beam.smp         46
4211         81         95         91          0
[ 0]                                      beam.smp         44
4088         83         95         92          0

The resuls shows, epoll_wait is invoked 1754 times each second, and
get 73 io events in average. This is unacceptable for writing high
performance server. Because if epoll_wait is invoked no more than 2k
times per second, then read/write a packet would cost more than 500ms,
which causes long delay and affects the throughput finally.

The problem relies on there is only one global pollset in system wide,
so at a time there is no more than one scheduler can call
erts_check_io() to obtain pending io tasks from underlying pollset,
and no scheduler can call erts_check_io() before all pending io
tasks're processed, so for IO bounded application, it's very likely
that a scheduler finish its own job, but must wait idly for other
schedulers to complete theirs.

Hence, we develops a patch to slove this problem, by having a pollset
for each scheduler, so that each scheduler can invoke erts_check_io()
on its own pollset concurrently. After a scheduler complete its tasks,
it can invoke erts_check_io() immediately no matter what state other
schedulers're in. This patch also handles port migration situation,
all used file descriptors in each port're recorded, when a port is
migrated, these
fd 're removed from original scheduler's pollset, and added to new scheduler's.

Bind port to scheduler together with process is also helpful to
performance, it reduces the cost of thread switches and
synchronization, and bound port won't be migrated between schedulers.

After apply the two patches, with the same pressure(100 concurrent
clients),epoll_wait is invoked 49332  times per second, and get 3
revents each time in average, that is to say, our server responds
quicker and become more realtime.

 cpu                                       process      times
revents        min        max        avg   timeouts
 all
49332     217186          -          -          4          3
[ 2]                                      beam.smp       3219
16050          2          7          4          0
[11]                                      beam.smp       4275
16033          1          6          3          0
[ 8]                                      beam.smp       4240
15992          1          6          3          0
[ 9]                                      beam.smp       4316
15964          0          6          3          2
[10]                                      beam.smp       4139
15851          1          6          3          0
[ 3]                                      beam.smp       4256
15816          1          6          3          0
[ 1]                                      beam.smp       3107
15800          2          7          5          0
[ 0]                                      beam.smp       3727
15259          1          6          4          0
[ 7]                                      beam.smp       2810
14722          3          7          5          0
[13]                                      beam.smp       1981
11499          4          7          5          0
[15]                                      beam.smp       2285
10942          3          6          4          0
[14]                                      beam.smp       2258
10866          3          6          4          0
[ 4]                                      beam.smp       2246
10849          3          6          4          0
[ 6]                                      beam.smp       2206
10730          3          6          4          0
[12]                                      beam.smp       2173
10573          3          6          4          0
[ 5]                                      beam.smp       2093
10240          3          6          4          0

scheduler_wait no longer take so much time now,

              169.00  6.2% process_main                    beam.smp
               55.00  2.0% _spin_lock                      [kernel]
               45.00  1.7% driver_deliver_term             beam.smp

so is idle CPU time
04:30:44 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft
%steal  %guest   %idle
04:30:45 PM  all   60.34    0.00   21.44    0.00    0.06   16.45
0.00    0.00    1.71
04:30:46 PM  all   60.99    0.00   21.22    0.00    0.00   16.26
0.00    0.00    1.52

and tcprstat shows, QPS is getting 100K, latency is less than 1 millisecond

timestamp	count	max	min	avg	med	stddev	95_max	95_avg	95_std	99_max	99_avg	99_std
1341822689	96078	11592	314	910	817	311	1447	869	228	1777	897	263
1341822690	100651	24245	209	914	819	381	1458	870	229	1800	899	263

I also write a extreamly simple keep-alive http server(attached at the
end of mail), to compare performance before and after applying the
patches, mearused with apache ab tool(ab -n 1000000 -c 100 -k ), a 30%
performance gain can be get.

before
Requests per second:    103671.70 [#/sec] (mean)
Time per request:       0.965 [ms] (mean)

after
Requests per second:    133701.24 [#/sec] (mean)
Time per request:       0.748 [ms] (mean)

Patches can be found at github, compile with
./configure CFLAGS=-DERTS_POLLSET_PER_SCHEDULER

Pollset per scheduler:

git fetch git://github.com/weicao/otp.git pollset_per_scheduler

https://github.com/weicao/otp/compare/weicao:master...weicao:pollset_per_scheduler
https://github.com/weicao/otp/compare/weicao:master...weicao:pollset_per_scheduler.patch

Bind port to scheduler:

git fetch git://github.com/weicao/otp.git bind_port_to_scheduler

https://github.com/weicao/otp/compare/weicao:pollset_per_scheduler...weicao:bind_port_to_scheduler
https://github.com/weicao/otp/compare/weicao:pollset_per_scheduler...weicao:bind_port_to_scheduler.patch

Appendix:

-----------------------------------
proxy.erl
------------------------------------

-module(proxy).
-compile(export_all).

-define(RECBUF_SIZE, 8192).
-define(ACCEPT_TIMEOUT, 2000).

start([MyPortAtom, DestIpAtom, DestPortAtom]) ->
    {MyPort, []} =  string:to_integer(atom_to_list(MyPortAtom)),
    DestIp = atom_to_list(DestIpAtom),
    {DestPort, []} = string:to_integer(atom_to_list(DestPortAtom)),
    listen(MyPort, DestIp, DestPort),
    receive Any ->
        io:format("recv ~p~n", [Any])
    end.

listen(MyPort, DestIp, DestPort) ->
    io:format("start proxy on [local] 0.0.0.0:~p -> [remote] ~p:~p~n",
[MyPort, DestIp, DestPort]),
    case gen_tcp:listen(MyPort,
           [inet,
            {ip, {0,0,0,0}},
            binary,
            {reuseaddr, true},
            {recbuf, ?RECBUF_SIZE},
            {active, false},
            {nodelay, true}
           ]) of
        {ok, Listen} ->
            N = erlang:system_info(schedulers),
            lists:foreach(fun(I) -> accept(Listen, DestIp, DestPort,
I) end, lists:seq(1,N));
        {error, Reason} ->
            io:format("error listen ~p~n", [Reason])
    end.

accept(Listen, DestIp, DestPort, I) ->
    spawn_opt(?MODULE, loop, [Listen, DestIp, DestPort, I], [{scheduler, I}]).

loop(Listen, DestIp, DestPort, I) ->
    case gen_tcp:accept(Listen, ?ACCEPT_TIMEOUT) of
        {ok, S1} ->
            accept(Listen, DestIp, DestPort, I),
            case catch gen_tcp:connect(DestIp, DestPort,
                           [inet, binary, {active, false},
                            {reuseaddr, true}, {nodelay, true}]) of
                {ok, S2} ->
                    io:format("new connection~n"),
                    loop1(S1, S2);
                {error, Reason} ->
                    io:format("error connect ~p~n", [Reason])
            end;
        {error, timeout} ->
            loop(Listen, DestIp, DestPort, I);
        Error ->
            io:format("error accept ~p~n", [Error]),
            accept(Listen, DestIp, DestPort, I)
    end.

loop1(S1, S2) ->
    active(S1, S2),
    receive
        {tcp, S1, Data} ->
            gen_tcp:send(S2, Data),
            loop1(S1, S2);
        {tcp, S2, Data} ->
            gen_tcp:send(S1, Data),
            loop1(S1, S2);
        {tcp_closed, S1} ->
            io:format("S1 close~n"),
            gen_tcp:close(S1),
            gen_tcp:close(S2);
        {tcp_closed, S2} ->
            io:format("S2 close~n"),
            gen_tcp:close(S1),
            gen_tcp:close(S2)
    end.

active(S1,S2)->
    inet:setopts(S1, [{active, once}]),
    inet:setopts(S2, [{active, once}]).

-----------------------------------
epollwait.stp
------------------------------------
#! /usr/bin/env stap
#
#

global epoll_timeout_flag, epoll_count, epoll_min, epoll_max,
epoll_times, epoll_timeouts

probe  syscall.epoll_wait {
    if(timeout > 0) {
        epoll_timeout_flag[pid()] = 1
    }
}

probe  syscall.epoll_wait.return {
  c = cpu()
  p = execname()

  epoll_times[c,p] ++
  epoll_count[c,p] += $return

  if($return == 0 && pid() in epoll_timeout_flag) {
    epoll_timeouts[c,p] ++
    delete epoll_timeout_flag[pid()]
  }

  if(!([c, p] in epoll_min)) {
    epoll_min[c,p] = $return
  } else if($return < epoll_min[c,p]) {
    epoll_min[c,p] = $return
  }

  if($return > epoll_max[c,p]) {
    epoll_max[c,p] = $return
  }
}

probe timer.s($1) {
  printf ("%4s %45s %10s %10s %10s %10s %10s %10s\n", "cpu",
"process", "times", "revents", "min", "max", "avg", "timeouts" )
  foreach ([cpu, process] in epoll_count- limit 20) {
      all_epoll_times += epoll_times[cpu,process]
      all_epoll_count += epoll_count[cpu,process]
      all_epoll_timeouts += epoll_timeouts[cpu,process]
  }
  printf ("%4s %45s %10d %10d %10s %10s %10d %10d\n",
    "all", "", all_epoll_times, all_epoll_count, "-", "-",
    all_epoll_count == 0? 0:all_epoll_count/all_epoll_times, all_epoll_timeouts)

  foreach ([cpu, process] in epoll_count- limit 20) {
     printf ("[%2d] %45s %10d %10d %10d %10d %10d %10d\n",
        cpu, process, epoll_times[cpu, process], epoll_count[cpu, process],
        epoll_min[cpu, process], epoll_max[cpu, process],
        epoll_count[cpu,process]/epoll_times[cpu,process],
        epoll_timeouts[cpu,process])
  }
  delete epoll_count
  delete epoll_min
  delete epoll_max
  delete epoll_times
  delete epoll_timeouts
  printf ("--------------------------------------------------------------------------\n\n"
)
}

------------------------------------------------
ehttpd.erl
-------------------------------------------------
-module(ehttpd).
-compile(export_all).

start() ->
    start(8888).
start(Port) ->
    N = erlang:system_info(schedulers),
    listen(Port, N),
    io:format("ehttpd ready with ~b schedulers on port ~b~n", [N, Port]),

    register(?MODULE, self()),
    receive Any -> io:format("~p~n", [Any]) end.  %% to stop: ehttpd!stop.

listen(Port, N) ->
    Opts = [inet,
            binary,
            {active, false},
            {recbuf, 8192},
            {nodelay,true},
            {reuseaddr, true}],

    {ok, S} = gen_tcp:listen(Port, Opts),
    lists:foreach(fun(I)-> spawn_opt(?MODULE, accept, [S, I],
[{scheduler, I}]) end, lists:seq(1, N)).

accept(S, I) ->
    case gen_tcp:accept(S) of
        {ok, Socket} ->
            spawn_opt(?MODULE, accept, [S, I],[{scheduler,I}] ),
            io:format("new connection @~p~n", [I]),
            loop(Socket,<<>>);
        Error    -> erlang:error(Error)
    end.

loop(S,B) ->
    inet:setopts(S, [{active, once}]),
    receive
        {tcp, S, Data} ->
            B1 = <<B/binary, Data/binary>>,
            case binary:part(B1,{byte_size(B1), -4}) of
                <<"\r\n\r\n">> ->
                    Response = <<"HTTP/1.1 200 OK\r\nContent-Length:
12\r\nConnection: Keep-Alive\r\n\r\nhello world!">>,
                    gen_tcp:send(S, Response),
                    loop(S, <<>>);
                _ ->
                    loop(S, B1)
            end;
        {tcp_closed, S} ->
            io:format("connection closed forced~n"),
            gen_tcp:close(S);
        Error ->
            io:format("unexpected message~p~n", [Error]),
            Error
    end.

-- 

Best,

Wei Cao