[erlang-questions] why is gen_tcp:send slow?

Wed Jun 25 09:44:04 CEST 2008

Hi,
no, unfortunately I don't have an answer for their observed performance. 
Although I haven't really looked at it either. Time, you know... :-)
But keep testing and try to figure it out. You always learn something, and 
hopefully you'll find the answer as well.

	Johnny

Edwin Fine skrev:
> Ok Johnny, I think I get it now. Thanks for the detailed explanation. I
> wonder why the original poster (Sergej/Rapsey) is seeing such poor TCP/IP
> performance? In any case, I am still going to do some more benchmarks to see
> if I can understand how the different components of TCP/IP communication in
> Erlang (inet:setopts() and gen_tcp) affect performance, CPU overhead and so
> on.
> 
> The reason I got into all this is because I was seeing very good performance
> between two systems on a LAN, and terrible performance over a non-local
> overseas link that had an RTT of about 290ms. Through various measurements
> and Wireshark usage I found the link was carrying only 3.4 packets per
> second, with only about 56 data bytes in each packet. When I investigated
> further, I found that a function I thought was running asynchronously was
> actually running synchronously inside a gen_server:call(). When I spawned
> the function, I still only saw 3.4 packets per second (using Wireshark
> timestamps) but each packet was now full of multiple blocks of data, not
> just 56 bytes, so the actual throughput went up hugely. Nothing else
> changed. When I tried to find out where the 3.4 was coming from, I
> calculated 1/3.4 = 0.294ms which was (coincidentally?) the exact RTT. That's
> why I thought there was a relationship between RTT and the number of
> packets/second a link could carry.
> 
> Now I have to go back and try to figure it all all over again :( unless you
> can explain it to me (he said hopefully).
> 
> Thanks
> Ed
> 
> 
> On Tue, Jun 24, 2008 at 7:15 PM, Johnny Billquist <bqt@REDACTED> wrote:
> 
>> Edwin, always happy to help out...
>>
>> Edwin Fine skrev:
>>
>>> Johnny,
>>>
>>> Thanks for the lesson! I am always happy to learn. Like I said, I am not
>>> an
>>> expert in TCP/IP.
>>>
>>> What I was writing about when I said that packets are acknowledged is what
>>> I
>>> saw in Wireshark while trying to understand performance issues. I perhaps
>>> should have said "TCP/IP" instead of just "TCP". There were definitely
>>> acknowledgements, but I guess they were at the IP level.
>>>
>> No. IP don't have any acknowledgements. IP (as well as UDP) is basically
>> just sending packets without any guarantee that they will ever reach the
>> other end.
>> What you saw was TCP acknowledgements, but you misunderstood how they work.
>>
>> Think of a TCP connection as an eternal length stream of bytes. Each byte
>> in this stream have a sequence number. TCP sends bytes in this stream,
>> packed into IP packets. Each IP packet will have one or several bytes from
>> that stream.
>> TCP at the other end will acknowledge the highest ordered byte that is has
>> received. How many packets it took to get to that byte is irrelevant, as is
>> any retransmissions, and so on... The window size tells how many additional
>> bytes from this stream can be sent, which is further on based in the point
>> which the acknowledgement points at.
>>
>> (In reality, the sequence numbers are not infitite, but are actually a
>> 32-bit number, which wraps. But since window sizes normally fits in a 16-bit
>> quantity, there is no chance of ever getting back to the same sequence
>> number again before it has long been passed by the time before, so no risk
>> of confusion or errors there.)
>>
>>  I wonder what the MSS is for loopback? I think it's about 1536 on my eth0
>>> interface, but not sure.
>>>
>> Smart implementations use the MTU - 40 of the interface as the MSS for a
>> loopback connection. Otherwise the thumb of rule is that if it's on the same
>> network, MSS is usually set to 1460 and 536 for destinations on other
>> networks.
>> This comes from the fact that the local network (usually ethernet) have an
>> MTU of 1500, and the IP header is normally 20 bytes, and so is the standard
>> TCP header, leaving 1460 bytes of data in an ethernet frame.
>> For non-local destinations, IP requires that atleast 576 byte packets can
>> go through unfragmented. The rest follows. :-)
>>
>>  As for RTT, I sent data over a link that had a very long (290ms) RTT, and
>>> that definitely limited the rate at which packets could be sent. Can RTT
>>> be
>>> used to calculate  the theoretical maximum traffic that a link can carry?
>>> For example, a satellite link with a 400ms RTT but 2 Mbps bandwidth?
>>>
>> No. RTT can not be used to calculate anything regarding traffic bandwidth.
>> You can keep sending packets until the window is exhausted, no matter what
>> the RTT says. The RTT is only used to calculate when to do retransmissions
>> if you haven't received an ACK.
>> The only other thing that affects packet rates are the slow start
>> algorithm. That will be affected by the round trip delays, since it adds a
>> throttling effect on the window, in addition to what the received says. The
>> reason for it being affected by the rount trip delay is that the slow start
>> window size is only increased when you get ACK packets back.
>> But, assuming the link can take the load, and you don't loose a lot of
>> packets, the slow start algorithm will pretty quickly stop being a factor.
>>
>>        Johnny
>>
>>
>>
>>> Ed
>>>
>>> On Tue, Jun 24, 2008 at 6:00 PM, Johnny Billquist <bqt@REDACTED> wrote:
>>>
>>>  No. TCP don't acknowledge every packet. In fact, TCP don't acknowledge
>>>> packets as such at all. TCP is not packet based. It's just that if you
>>>> use
>>>> IP as the carrier, IP itself it packet based.
>>>> TCP can in theory generate any number of packets per second. However, the
>>>> amount of unacknowledged data that can be outstanding at any time is
>>>> limited
>>>> by the transmit window. Each packet carries a window size, which is how
>>>> much
>>>> more data that can be accepted by the reciever. TCP can (is allowed to)
>>>> send
>>>> that much data and no more.
>>>>
>>>> The RTT calculations are used for figuring out how long to wait before
>>>> doing retransmissions. You also normally have a slow start transmission
>>>> algorithm which prevents the sender from even using the full window size
>>>> from the start, as a way of avoiding congestions. That is used in
>>>> combination with a backoff algorithm when retransmissions are needed to
>>>> further decrease congestions, but all of this only really comes into
>>>> effect
>>>> if you start loosing data, and TCP actually needs to do retransmissions.
>>>>
>>>> Another thing you have is an algorithm called Nagle, which tries to
>>>> collect
>>>> small amount of data sent into larger packets before sending it, so that
>>>> you
>>>> don't flood the net with silly small packets.
>>>>
>>>> One addisional detail is that receivers normally, when the receive
>>>> buffers
>>>> becomes full, don't announce newly freed space immediately, since that is
>>>> normally rather small amounts, but instead wait a while, until a larger
>>>> part
>>>> of the receive buffer is free, so that the sender actually can send some
>>>> full sized packets once it starts sending again.
>>>>
>>>> In addition to all this, you also have a max segment size which is
>>>> negotiated between the TCP ends, which limit the size of a single IP
>>>> packet
>>>> sent by the TCP protocol. This is done in order to try to avoid packet
>>>> fragmentation.
>>>>
>>>> So the window size is actually a flow control mechanism, and is in
>>>> reality
>>>> limiting the amount of data that can be sent. And it varies all the time.
>>>> And the number of packets that will be used for sending that much data is
>>>> determined by the MSS (Max Segment Size).
>>>>
>>>> Sorry for the long text on how TCP works. :-)
>>>>
>>>>       Johnny
>>>>
>>>> Edwin Fine wrote:
>>>>
>>>>  David,
>>>>> Thanks for trying out the benchmark.
>>>>>
>>>>> With my limited knowledge of TCP/IP, I believe you are seeing the
>>>>> 300,000
>>>>> limit because TCP/IP requires acknowledgements to each packet, and
>>>>> although
>>>>> it can batch up multiple acknowledgements in one packet, there is a
>>>>> theoretical limit of packets per seconds beyond which it cannot go due
>>>>> to
>>>>> the laws of physics. I understand that limit is determined by the
>>>>> Round-Trip
>>>>> Time (RTT), which can be shown by ping. On my system, pinging 127.0.0.1<
>>>>> http://127.0.0.1> gives a minimum RTT of 0.018 ms (out of 16 pings).
>>>>> That
>>>>> means that the maximum number of packets that can make it to and dest
>>>>> and
>>>>> back per second is 1/0.000018 seconds, or 55555 packets per second. The
>>>>> TCP/IP stack is evidently packing 5 or 6 blocks into each packet to get
>>>>> the
>>>>> 300K blocks/sec you are seeing. Using Wireshark or Ethereal would
>>>>> confirm
>>>>> this. I am guessing that this means that the TCP window is about 6 *
>>>>> 1000
>>>>> bytes or 6KB.
>>>>>
>>>>> What I neglected to tell this group is that I have modified the Linux
>>>>> sysctl.conf as follows, which might have had an effect (like I said, I
>>>>> am
>>>>> not an expert):
>>>>>
>>>>> # increase Linux autotuning TCP buffer limits
>>>>> # min, default, and max number of bytes to use
>>>>> # set max to at least 4MB, or higher if you use very high BDP paths
>>>>> net.ipv4.tcp_rmem = 4096 87380 16777216
>>>>> net.ipv4.tcp_wmem = 4096 32768 16777216
>>>>>
>>>>> When I have more time, I will vary a number of different Erlang TCP/IP
>>>>> parameters and get a data set together that gives a broader picture of
>>>>> the
>>>>> effect of the parameters.
>>>>>
>>>>> Thanks again for taking the time.
>>>>>
>>>>> 2008/6/24 David Mercer <dmercer@REDACTED <mailto:dmercer@REDACTED>>:
>>>>>
>>>>>   I tried some alternative block sizes (using the blksize option).  I
>>>>>   found that from 1 to somewhere around––maybe a bit short of––1000
>>>>>   bytes, the test was able to send about 300,000 blocks in 10 seconds
>>>>>   regardless of size.  (That means, 0.03 MB/sec for block size of 1,
>>>>>   0.3 MB/sec for block size of 10, 3 MB/sec  for block size of 100,
>>>>>   etc.)  I suspect the system was CPU bound at those levels.
>>>>>
>>>>>
>>>>>   Above 1000, the number of blocks sent seemed to decrease, though
>>>>>   this was more than offset by the increased size of the blocks.
>>>>>  Above
>>>>> about 10,000 byte blocks (may have been less, I didn't check
>>>>>   any value between 4,000 and 10,000), however, performance peaked and
>>>>>   block size no longer mattered: it always sent between 70 and 80
>>>>>   MB/sec.  My machine is clearly slower than Edwin's…
>>>>>
>>>>>
>>>>>   DBM
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>  ------------------------------------------------------------------------
>>>>>
>>>>>   *From:* erlang-questions-bounces@REDACTED
>>>>>   <mailto:erlang-questions-bounces@REDACTED>
>>>>>   [mailto:erlang-questions-bounces@REDACTED
>>>>>   <mailto:erlang-questions-bounces@REDACTED>] *On Behalf Of *Rapsey
>>>>>   *Sent:* Tuesday, June 24, 2008 14:01
>>>>>   *To:* erlang-questions@REDACTED <mailto:erlang-questions@REDACTED
>>>>>   *Subject:* Re: [erlang-questions] why is gen_tcp:send slow?
>>>>>
>>>>>
>>>>>   You're using very large packets. I think the results would be much
>>>>>   more telling if the packets would be a few kB at most. That is
>>>>>   closer to most real life situations.
>>>>>
>>>>>
>>>>>   Sergej
>>>>>
>>>>>   On Tue, Jun 24, 2008 at 8:43 PM, Edwin Fine
>>>>>   <erlang-questions_efine@REDACTED
>>>>>   <mailto:erlang-questions_efine@REDACTED>> wrote:
>>>>>
>>>>>   I wrote a small benchmark in Erlang to see how fast I could get
>>>>>   socket communications to go. All the benchmark does is pump the same
>>>>>   buffer to a socket for (by default) 10 seconds. It uses {active,
>>>>>   once} each time, just like you do.
>>>>>
>>>>>   Server TCP options:
>>>>>        {active, once},
>>>>>           {reuseaddr, true},
>>>>>           {packet, 0},
>>>>>           {packet_size, 65536},
>>>>>           {recbuf, 1000000}
>>>>>
>>>>>   Client TCP options:
>>>>>           {packet, raw},
>>>>>           {packet_size, 65536},
>>>>>           {sndbuf, 1024 * 1024},
>>>>>           {send_timeout, 3000}
>>>>>
>>>>>   Here are some results using Erlang R12B-3 (erl +K true in the Linux
>>>>>   version):
>>>>>
>>>>>   Linux (Ubuntu 8.10 x86_64, Intel Core 2 Q6600, 8 GB):
>>>>>   - Using localhost (127.0.0.1 <http://127.0.0.1>): 7474.14 MB in
>>>>>
>>>>>   10.01 secs (746.66 MB/sec)
>>>>>   - Using 192.168.x.x IP address: 8064.94 MB in 10.00 secs (806.22
>>>>>   MB/sec) [Don't ask me why it's faster than using loopback, I
>>>>>   repeated the tests and got the same result]
>>>>>
>>>>>   Windows XP SP3 (32 bits), Intel Core 2 Duo E6600:
>>>>>   - Using loopback: 2166.97 MB in 10.02 secs (216.35 MB/sec)
>>>>>   - Using 192.168.x.x IP address: 2140.72 MB in 10.02 secs (213.75
>>>>> MB/sec)
>>>>>   - On Gigabit Ethernet to the Q6600 Linux box: 1063.61 MB in 10.02
>>>>>   secs (106.17 MB/sec) using non-jumbo frames. I don't think my router
>>>>>   supports jumbo frames.
>>>>>
>>>>>   There's undoubtedly a huge discrepancy between the two systems,
>>>>>   whether because of kernel poll in Linux, or that it's 64 bits, or
>>>>>   unoptimized Windows TCP/IP flags, I don't know. I don't believe it's
>>>>>   the number of CPUs (there's only 1 process sending and one
>>>>>   receiving), or the CPU speed (they are both 2.4 GHz Core 2s).
>>>>>
>>>>>   Maybe some Erlang TCP/IP gurus could comment.
>>>>>
>>>>>   I've attached the code for interest. It's not supposed to be
>>>>>   production quality, so please don't beat me up :) although I am
>>>>>   always open to suggestions for improvement. If you do improve it,
>>>>>   I'd like to see what you've done. Maybe there is another simple
>>>>>   Erlang tcp benchmark program out there (i.e. not Tsung), but I
>>>>>   couldn't find one in a cursory Google search.
>>>>>
>>>>>   To run:
>>>>>
>>>>>   VM1:
>>>>>
>>>>>   tb_server:start(Port, Opts).
>>>>>   tb_server:stop() to stop.
>>>>>
>>>>>   Port = integer()
>>>>>   Opts = []|[opt()]
>>>>>   opt() = {atom(), term()} (Accepts inet setopts options, too)
>>>>>
>>>>>   The server prints out the transfer rate (for simplicity).
>>>>>
>>>>>   VM2:
>>>>>   tb_client(Host, Port, Opts).
>>>>>
>>>>>   Host = atom()|string() hostname or IP address
>>>>>   Port, Opts as in tb_server
>>>>>
>>>>>   Runs for 10 seconds, sending a 64K buffer as fast as possible to
>>>>>   Host/Port.
>>>>>   You can change this to 20 seconds (e.g.) by adding the tupls
>>>>>   {time_limit, 20000} to Opts.
>>>>>   You can change buffer size by adding the tuple {blksize, Bytes} to
>>>>> Opts.
>>>>>
>>>>>   2008/6/20 Rapsey <rapsey@REDACTED <mailto:rapsey@REDACTED>>:
>>>>>
>>>>>   All data goes through nginx which acts as a proxy. Its CPU
>>>>>   consumption is never over 1%.
>>>>>
>>>>>
>>>>>   Sergej
>>>>>
>>>>>
>>>>>   On Thu, Jun 19, 2008 at 9:35 PM, Javier París Fernández
>>>>>   <javierparis@REDACTED <mailto:javierparis@REDACTED>> wrote:
>>>>>
>>>>>
>>>>>   El 19/06/2008, a las 20:06, Rapsey escribió:
>>>>>
>>>>>
>>>>>       It loops from another module, that way I can update the code at
>>>>>       any time without disrupting anything.
>>>>>       The packets are generally a few hundred bytes big, except
>>>>>       keyframes which tend to be in the kB range. I haven't tried
>>>>>       looking with wireshark.  Still it seems a bit odd that a large
>>>>>       CPU consumption would be the symptom. The traffic is strictly
>>>>>       one way. Either someone is sending the stream or receiving it.
>>>>>       The transmit could of course be written with a passive receive,
>>>>>       but the code would be significantly uglier. I'm sure someone
>>>>>       here knows if setting {active, once} every packet is CPU
>>>>>       intensive or not.
>>>>>       It seems the workings of gen_tcp is quite platform dependent. If
>>>>>       I run the code in windows, sending more than 128 bytes per
>>>>>       gen_tcp call significantly decreases network output.
>>>>>       Oh and I forgot to mention I use R12B-3.
>>>>>
>>>>>
>>>>>   Hi,
>>>>>
>>>>>   Without being an expert.
>>>>>
>>>>>   200-300 mb/s  in small (hundreds of bytes) packets means a *lot* of
>>>>>   system calls if you are doing a gen_tcp:send for each one. If you
>>>>>   buffer 3 packets, you are reducing that by a factor of 3 :). I'd try
>>>>>   to do an small test doing the same thing in C and compare the
>>>>>   results. I think it will also eat a lot of CPU.
>>>>>
>>>>>   About the proxy CPU... I'm a bit lost about it, but speculating
>>>>>   wildly it is possible that the time spent doing the system calls
>>>>>   that gen_tcp is doing is added to the proxy CPU process.
>>>>>
>>>>>   Regards.
>>>>>
>>>>>
>>>>>
>>>>>   _______________________________________________
>>>>>   erlang-questions mailing list
>>>>>   erlang-questions@REDACTED <mailto:erlang-questions@REDACTED>
>>>>>   http://www.erlang.org/mailman/listinfo/erlang-questions
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>   _______________________________________________
>>>>>   erlang-questions mailing list
>>>>>   erlang-questions@REDACTED <mailto:erlang-questions@REDACTED>
>>>>>   http://www.erlang.org/mailman/listinfo/erlang-questions
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------
>>>>>
>>>>> _______________________________________________
>>>>> erlang-questions mailing list
>>>>> erlang-questions@REDACTED
>>>>> http://www.erlang.org/mailman/listinfo/erlang-questions
>>>>>
>>>>>
>>>>
>> --
>> Johnny Billquist                  || "I'm on a bus
>>                                  ||  on a psychedelic trip
>> email: bqt@REDACTED             ||  Reading murder books
>> pdp is alive!                     ||  tryin' to stay hip" - B. Idol
>>
>>
> 

-- 
Johnny Billquist                  || "I'm on a bus
                                   ||  on a psychedelic trip
email: bqt@REDACTED             ||  Reading murder books
pdp is alive!                     ||  tryin' to stay hip" - B. Idol