Unhappiness with high TCP accept rates?

Scott Lystig Fritchie fritchie@REDACTED
Mon Mar 6 09:15:05 CET 2006


I've spent a very puzzling weekend trying to figure out what has been
causing a problem with an Erlang application that would get perhaps
100 TCP connection attempts immediately and then new connections every
20-30 milliseconds.  The Erlang VM was CPU-bound (very busy, but I had
some nasty load generators, so "very busy" was normal)...

... and then the problems began.

The weirdest problem is the very, very strange looking "tcpdump" packet
captures I was getting.  This was captured on the *server* side
(server = 192.168.0.165), so all the "TCP Retransmission" stuff doesn't
make any sense if the receiving machine can see each packet that was
allegedly dropped & retransmitted!  Then, making matters worse, the
server sends multiple SYN-ACK packets, spaced several seconds apart.

When this sort of thing happened, I would see worst-case latency rates
for this server shoot up from ~150 milliseconds to 10-30 seconds.

When I used the "appmon" to look at the supervisor tree, I would see
per-TCP-session processes still alive, even after their TCP socket
was supposed to be closed: "netstat -a" on the client side showed
nothing, "lsof -p beam-pid" showed that the file descriptors were
closed, but Erlang processes were stuck waiting in prim_inet:send/2.
Or the server-side socket was still open (confirmed by "netstat" and
"lsof") but the client-side was closed, and the Erlang processes
stuck waiting in prim_inet:recv0/3.  (See below, after the
"tethereal" packet summary.)

I managed to make this problem go away by:

	1. Increasing the gen_tcp:accept/2 {backlog, N} bigger.
	2. Changing the priority of the Erlang process doing
	   the accept() work from normal to high.

I dunno why this worked, but it does.  I was using R10B-9 on a 32-bit
Intel box, Red Hat Enterprise Linux 3 Update 4.

Do any of these problems sound even vaguely familiar?  Or did I merely
stumble into a cursed weekend?

-Scott

--- snip --- snip --- snip --- snip --- snip --- snip --- 

I can understand why packet 293 didn't get a response: the server's
accept(2) backlog was exceeded.  But it answers later in packet 16666,
except that it doesn't seem really serious about communicating until
packet 54751.  I've never seen this TCP behavior before.

293   0.306478 192.168.0.166 -> 192.168.0.165 TCP 57463 > 7474 [SYN] Seq=0 Ack=0 Win=5792 Len=0 MSS=1460 
16665   3.297657 192.168.0.166 -> 192.168.0.165 TCP 57463 > 7474 [SYN] Seq=0 Ack=0 Win=5792 Len=0 MSS=1460 
16666   3.297668 192.168.0.165 -> 192.168.0.166 TCP 7474 > 57463 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1460 
16718   3.298198 192.168.0.166 -> 192.168.0.165 TCP 57463 > 7474 [ACK] Seq=1 Ack=1 Win=5792 Len=0 
16754   3.299662 192.168.0.166 -> 192.168.0.165 TCP 57463 > 7474 [PSH, ACK] Seq=1 Ack=1 Win=5792 Len=36 
17961   3.517092 192.168.0.166 -> 192.168.0.165 TCP [TCP Retransmission] 57463 > 7474 [PSH, ACK] Seq=1 Ack=1 Win=5792 Len=36 
20362   3.957024 192.168.0.166 -> 192.168.0.165 TCP [TCP Retransmission] 57463 > 7474 [PSH, ACK] Seq=1 Ack=1 Win=5792 Len=36 
25186   4.836903 192.168.0.166 -> 192.168.0.165 TCP [TCP Retransmission] 57463 > 7474 [PSH, ACK] Seq=1 Ack=1 Win=5792 Len=36 
34206   6.494619 192.168.0.165 -> 192.168.0.166 TCP 7474 > 57463 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1460 
34211   6.494736 192.168.0.166 -> 192.168.0.165 TCP 57463 > 7474 [ACK] Seq=37 Ack=1 Win=5792 Len=0 
34784   6.596630 192.168.0.166 -> 192.168.0.165 TCP [TCP Retransmission] 57463 > 7474 [PSH, ACK] Seq=1 Ack=1 Win=5792 Len=36 
54088  10.116193 192.168.0.166 -> 192.168.0.165 TCP [TCP Retransmission] 57463 > 7474 [PSH, ACK] Seq=1 Ack=1 Win=5792 Len=36 
54728  12.493896 192.168.0.165 -> 192.168.0.166 TCP 7474 > 57463 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1460 
54729  12.493957 192.168.0.166 -> 192.168.0.165 TCP 57463 > 7474 [ACK] Seq=37 Ack=1 Win=5792 Len=0 
54750  17.154814 192.168.0.166 -> 192.168.0.165 TCP [TCP Retransmission] 57463 > 7474 [PSH, ACK] Seq=1 Ack=1 Win=5792 Len=36 
54751  17.154858 192.168.0.165 -> 192.168.0.166 TCP 7474 > 57463 [ACK] Seq=1 Ack=37 Win=5792 Len=0 
54756  17.156871 192.168.0.165 -> 192.168.0.166 TCP 7474 > 57463 [ACK] Seq=1 Ack=37 Win=5792 Len=1448 
54757  17.156892 192.168.0.165 -> 192.168.0.166 TCP 7474 > 57463 [ACK] Seq=1449 Ack=37 Win=5792 Len=1448 
54758  17.156904 192.168.0.165 -> 192.168.0.166 TCP 7474 > 57463 [PSH, ACK] Seq=2897 Ack=37 Win=5792 Len=1448 
54759  17.156983 192.168.0.166 -> 192.168.0.165 TCP 57463 > 7474 [ACK] Seq=37 Ack=1449 Win=8688 Len=0 
54760  17.157002 192.168.0.165 -> 192.168.0.166 TCP 7474 > 57463 [ACK] Seq=4345 Ack=37 Win=5792 Len=1448 
54761  17.157008 192.168.0.165 -> 192.168.0.166 TCP 7474 > 57463 [PSH, ACK] Seq=5793 Ack=37 Win=5792 Len=892 
54762  17.157025 192.168.0.166 -> 192.168.0.165 TCP 57463 > 7474 [ACK] Seq=37 Ack=2897 Win=11584 Len=0 
54763  17.157051 192.168.0.166 -> 192.168.0.165 TCP 57463 > 7474 [ACK] Seq=37 Ack=4345 Win=14480 Len=0 
54764  17.157097 192.168.0.166 -> 192.168.0.165 TCP 57463 > 7474 [ACK] Seq=37 Ack=5793 Win=17376 Len=0 
54765  17.157114 192.168.0.166 -> 192.168.0.165 TCP 57463 > 7474 [ACK] Seq=37 Ack=6685 Win=20272 Len=0 
54766  17.157142 192.168.0.166 -> 192.168.0.165 TCP 57463 > 7474 [FIN, ACK] Seq=37 Ack=6685 Win=20272 Len=0 
54767  17.157829 192.168.0.165 -> 192.168.0.166 TCP 7474 > 57463 [FIN, ACK] Seq=6685 Ack=38 Win=5792 Len=0 
54768  17.157892 192.168.0.166 -> 192.168.0.165 TCP 57463 > 7474 [ACK] Seq=38 Ack=6686 Win=20272 Len=0 

--- snip --- snip --- snip --- snip --- snip --- snip --- 

[ Just a few of the processes are shown ... they all looked like one
  of the two below.]

Node: tmp99@REDACTED, Process: <0.18114.0>
[{current_function,{prim_inet,send,2}},
 {initial_call,{pss_protocol,new_session,2}},
 {status,waiting},
 {message_queue_len,0},
 {messages,[]},
 {links,[#Port<0.19360>,<0.186.0>]},
 {dictionary,[{random_seed,{1141,15905,3715}},{line_number,0}]},
 {trap_exit,false},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<0.44.0>},
 {heap_size,196418},
 {stack_size,10},
 {reductions,161939},
 {garbage_collection,[{fullsweep_after,65535}]}]

Node: tmp99@REDACTED, Process: <0.16327.0>
[{current_function,{prim_inet,recv0,3}},
 {initial_call,{pss_protocol,new_session,2}},
 {status,waiting},
 {message_queue_len,0},
 {messages,[]},
 {links,[<0.186.0>,#Port<0.17416>]},
 {dictionary,[{random_seed,{1141,15647,4245}},{line_number,0}]},
 {trap_exit,false},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<0.44.0>},
 {heap_size,233},
 {stack_size,11},
 {reductions,81},
 {garbage_collection,[{fullsweep_after,65535}]}]

[ This was the process info of the gen_server that was overseeing
  all those hung processes.]

Node: tmp99@REDACTED, Process: <0.186.0>
[{registered_name,pss_sess_mgr},
 {current_function,{gen_server,loop,6}},
 {initial_call,{proc_lib,init_p,5}},
 {status,waiting},
 {message_queue_len,0},
 {messages,[]},
 {links,[<0.6083.0>,
         <0.8842.0>,
         <0.13494.0>,
         <0.18113.0>,
         <0.18115.0>,
         <0.18114.0>,
         <0.16327.0>,
         <0.11839.0>,
         <0.13493.0>,
         <0.13492.0>,
         <0.8844.0>,
         <0.11827.0>,
         <0.8843.0>,
         <0.7662.0>,
         <0.8059.0>,
         <0.8451.0>,
         <0.8452.0>,
         <0.8450.0>,
         <0.8057.0>,
         <0.8058.0>,
         <0.7663.0>,
         <0.7266.0>,
         <0.7268.0>,
         <0.7661.0>,
         <0.7267.0>,
         <0.6478.0>,
         <0.6871.0>,
         <0.6872.0>,
         <0.6870.0>,
         <0.6476.0>,
         <0.6477.0>,
         <0.6084.0>,
         <0.2574.0>,
         <0.4140.0>,
         <0.4918.0>,
         <0.5690.0>,
         <0.5692.0>,
         <0.6082.0>,
         <0.5691.0>,
         <0.5303.0>,
         <0.5304.0>,
         <0.5302.0>,
         <0.4529.0>,
         <0.4916.0>,
         <0.4917.0>,
         <0.4530.0>,
         <0.4142.0>,
         <0.4528.0>,
         <0.4141.0>,
         <0.3354.0>,
         <0.3755.0>,
         <0.3756.0>,
         <0.3754.0>,
         <0.3352.0>,
         <0.3353.0>,
         <0.2966.0>,
         <0.2967.0>,
         <0.2965.0>,
         <0.1789.0>,
         <0.2572.0>,
         <0.2573.0>,
         <0.2182.0>,
         <0.2183.0>,
         <0.2181.0>,
         <0.1007.0>,
         <0.1396.0>,
         <0.1787.0>,
         <0.1788.0>,
         <0.1397.0>,
         <0.1009.0>,
         <0.1395.0>,
         <0.1008.0>,
         <0.231.0>,
         <0.615.0>,
         <0.616.0>,
         <0.614.0>,
         <0.229.0>,
         <0.230.0>,
         <0.46.0>]},
 {dictionary,[{'$ancestors',[pss_sup,<0.45.0>]},
              {'$initial_call',{gen,init_it,
                                    [gen_server,
                                     <0.46.0>,
                                     <0.46.0>,
                                     {local,pss_sess_mgr},
                                     pss_sess_mgr,
                                     [pss_protocol],
                                     []]}}]},
 {trap_exit,true},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<0.44.0>},
 {heap_size,233},
 {stack_size,12},
 {reductions,330201},
 {garbage_collection,[{fullsweep_after,65535}]}]




More information about the erlang-questions mailing list