strange packet loss (solved, dtrace is your friend!)

Joel Reymont joelr1@REDACTED
Wed Aug 12 13:43:59 CEST 2009


I solved this issue with the help of dtrace and thought I'd log the  
troubleshooting process for posterity. As I suspected, it was all my  
own fault.

I'm on Mac OSX Snow Leopard but it should work on Leopard just as  
well. Solaris may require a few tweaks.

I used tcpdump at first to make sure "lost" packets were getting to  
the receiver, e.g.


---
17:04:19.009795 IP (tos 0x0, ttl 64, id 40175, offset 0, flags [DF],  
proto TCP (6), length 57, bad cksum 0 (->9fcd)!)
    localhost.us-cli > localhost.47835: Flags [P.], cksum 0xfe2d  
(incorrect -> 0x7df6), seq 1:6, ack 40, win 65535, options [nop,nop,TS  
val 960438746 ecr 960438746], length 5
	0x0000:  4500 0039 9cef 4000 4006 0000 7f00 0001  E..9..@REDACTED@.......
	0x0010:  7f00 0001 1f92 badb 7b0a ad1e fc66 b940  ........{....f.@
	0x0020:  8018 ffff fe2d 0000 0101 080a 393f 21da  .....-......9?!.
	0x0030:  393f 21da 0003 4143 4b                   9?!...ACK
---


I then used dtruss to look at the system calls and noticed that Erlang  
used recvfrom to receive the packets. I proceeded to use dapptrace to  
look at the function calls and discovered tcp_recv. A simple grep  
located tcp_recv in the inet driver (erts/emulator/drivers/common/ 
inet_drv.c).

I whipped out a custom dtrace script to dump the buffer populated by  
recvfrom as well as to look at the i_remain field of the  
tcp_descriptor structure. i_remain is what controls how much data  
there's to read.

Notice in this output that there are 0 bytes remaining to read.


---
   0  -> tcp_recv                              tcp_recv(0x1006E69B0,  
0x0, 0x0), i_remain = 0

   0    -> recvfrom                            recvfrom(0xB,  
0x1006AD470, 0x5B4)

   0    <- recvfrom                            recvfrom = 17

              0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f   
0123456789abcdef
          0: 00 03 41 43 4b 7f 00 00 00 00 00 00 00 00 00  
00  ..ACK...........
---


There are bytes remaining to read when a "lost" packet gets into Erlang.


---
   0  -> tcp_recv                              tcp_recv(0x1025C42E8,  
0x0, 0x0), i_remain = 16706

   0    -> recvfrom                            recvfrom(0x75B,  
0x1006AD473, 0x4142)

   0    <- recvfrom                            recvfrom = 17

              0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f   
0123456789abcdef
          0: 00 03 41 43 4b 00 00 00 00 00 00 00 00 e5 73  
c1  ..ACK.........s.
---


Searching dtrace output from receives on the same socket (0x75B), I  
saw an ACK previously received without the 2-byte length prefix.


---
-> tcp_recv tcp_recv(0x1025C42E8, 0x0, 0x0), i_remain = 0

    
1 
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -> 
  recvfrom recvfrom(0x75B, 0x1006AD470, 0x5B4)

    
1 
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            < 
- recvfrom recvfrom = 17

              0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f   
0123456789abcdef
          0: 41 43 4b 43 4b 7f 00 00 00 00 00 00 00 00 00 00   
ACKCK...........
---


My server sent ACK without the size prefix, i.e. {packet, 0} and  
Erlang on the client size interpreted "AC" as the 2-byte packet size  
since the client is in {packet, 2} mode. Erlang then proceeded to  
gather network packets to reach the 16707 byte target and this looked  
like packet loss on the client side.


---
<<Size:16>> = <<16#41, 16#43>>.
<<"AC">>

Size.
16707
---


Finally, here's the dtrace script that helped me take my foot out of  
my own mouth.


---
#pragma D option flowindent

typedef struct {
     char  inet[400];
     int   high;
     int   low;
     int   send_timeout;
     int   send_timeout_close;
     int   busy_on_send;
     int   i_bufsz;
     char* i_buf;
     char*         i_ptr;
     char*         i_ptr_start;
     int           i_remain;
     int           tcp_add_flags;
     int           http_state;
     void *multi_first;
     void *multi_last;
     void *mtd;
} tcp_descriptor;

pid$target::recvfrom:entry {
     self->ptr = arg1;
     printf("%s(0x%X, 0x%X, 0x%X)\n", probefunc, arg0, arg1, arg2);
}

pid$target::recvfrom:return {
     printf("%s = %d\n", probefunc, (int)arg0);
     tracemem(copyin(self->ptr, (int)arg0), 16);
}

pid$target::tcp_recv:entry {
     self->desc = copyin(arg0, sizeof(tcp_descriptor));
     printf("%s(0x%X, 0x%X, 0x%X), i_remain = %d\n", probefunc, arg0,  
arg1, arg2, ((tcp_descriptor *)self->desc)->i_remain);
}
---

I cheated by grabbing the size of 'struct inet_descriptor' in gdb (400  
bytes on x86-64) and defining unessential pointers as 'void *'. Note  
my use of 'tracemem' to dump memory, 'copyin' to move memory into the  
local memory space (dtrace runs in the kernel) and, finally, the  
casting to 'tcp_descriptor *' to grab the i_remain field.

Why the mucking about with packet size?

This particular server handles flash connections. On the one hand,  
Flash doesn't know about {packet, 2} and on the other hand the only  
thing that Flash does that's out of programmer control is request the  
policy file.

The server starts with {packet, 0} and looks for <<"<policy-file- 
request/>", 0>>, sent automatically by browser Flash. Once Flash  
control is back to the programmer, <<255, 255, 255, 255>> is sent to  
switch the server into {packet, 2}.

I use a gen_fsm to distinguish between packet length of 65535 (<<255,  
255>>) and the <<255, 255, 255, 255>> switching sequence above.

---
Mac hacker with a performance bent
http://www.linkedin.com/in/joelreymont



More information about the erlang-questions mailing list