[erlang-questions] why is gen_tcp:send slow?

Wed Jun 25 02:38:15 CEST 2008

Ok Johnny, I think I get it now. Thanks for the detailed explanation. I
wonder why the original poster (Sergej/Rapsey) is seeing such poor TCP/IP
performance? In any case, I am still going to do some more benchmarks to see
if I can understand how the different components of TCP/IP communication in
Erlang (inet:setopts() and gen_tcp) affect performance, CPU overhead and so
on.

The reason I got into all this is because I was seeing very good performance
between two systems on a LAN, and terrible performance over a non-local
overseas link that had an RTT of about 290ms. Through various measurements
and Wireshark usage I found the link was carrying only 3.4 packets per
second, with only about 56 data bytes in each packet. When I investigated
further, I found that a function I thought was running asynchronously was
actually running synchronously inside a gen_server:call(). When I spawned
the function, I still only saw 3.4 packets per second (using Wireshark
timestamps) but each packet was now full of multiple blocks of data, not
just 56 bytes, so the actual throughput went up hugely. Nothing else
changed. When I tried to find out where the 3.4 was coming from, I
calculated 1/3.4 = 0.294ms which was (coincidentally?) the exact RTT. That's
why I thought there was a relationship between RTT and the number of
packets/second a link could carry.

Now I have to go back and try to figure it all all over again :( unless you
can explain it to me (he said hopefully).

Thanks
Ed

On Tue, Jun 24, 2008 at 7:15 PM, Johnny Billquist <bqt@REDACTED> wrote:

> Edwin, always happy to help out...
>
> Edwin Fine skrev:
>
>> Johnny,
>>
>> Thanks for the lesson! I am always happy to learn. Like I said, I am not
>> an
>> expert in TCP/IP.
>>
>> What I was writing about when I said that packets are acknowledged is what
>> I
>> saw in Wireshark while trying to understand performance issues. I perhaps
>> should have said "TCP/IP" instead of just "TCP". There were definitely
>> acknowledgements, but I guess they were at the IP level.
>>
>
> No. IP don't have any acknowledgements. IP (as well as UDP) is basically
> just sending packets without any guarantee that they will ever reach the
> other end.
> What you saw was TCP acknowledgements, but you misunderstood how they work.
>
> Think of a TCP connection as an eternal length stream of bytes. Each byte
> in this stream have a sequence number. TCP sends bytes in this stream,
> packed into IP packets. Each IP packet will have one or several bytes from
> that stream.
> TCP at the other end will acknowledge the highest ordered byte that is has
> received. How many packets it took to get to that byte is irrelevant, as is
> any retransmissions, and so on... The window size tells how many additional
> bytes from this stream can be sent, which is further on based in the point
> which the acknowledgement points at.
>
> (In reality, the sequence numbers are not infitite, but are actually a
> 32-bit number, which wraps. But since window sizes normally fits in a 16-bit
> quantity, there is no chance of ever getting back to the same sequence
> number again before it has long been passed by the time before, so no risk
> of confusion or errors there.)
>
>  I wonder what the MSS is for loopback? I think it's about 1536 on my eth0
>> interface, but not sure.
>>
>
> Smart implementations use the MTU - 40 of the interface as the MSS for a
> loopback connection. Otherwise the thumb of rule is that if it's on the same
> network, MSS is usually set to 1460 and 536 for destinations on other
> networks.
> This comes from the fact that the local network (usually ethernet) have an
> MTU of 1500, and the IP header is normally 20 bytes, and so is the standard
> TCP header, leaving 1460 bytes of data in an ethernet frame.
> For non-local destinations, IP requires that atleast 576 byte packets can
> go through unfragmented. The rest follows. :-)
>
>  As for RTT, I sent data over a link that had a very long (290ms) RTT, and
>> that definitely limited the rate at which packets could be sent. Can RTT
>> be
>> used to calculate  the theoretical maximum traffic that a link can carry?
>> For example, a satellite link with a 400ms RTT but 2 Mbps bandwidth?
>>
>
> No. RTT can not be used to calculate anything regarding traffic bandwidth.
> You can keep sending packets until the window is exhausted, no matter what
> the RTT says. The RTT is only used to calculate when to do retransmissions
> if you haven't received an ACK.
> The only other thing that affects packet rates are the slow start
> algorithm. That will be affected by the round trip delays, since it adds a
> throttling effect on the window, in addition to what the received says. The
> reason for it being affected by the rount trip delay is that the slow start
> window size is only increased when you get ACK packets back.
> But, assuming the link can take the load, and you don't loose a lot of
> packets, the slow start algorithm will pretty quickly stop being a factor.
>
>        Johnny
>
>
>
>> Ed
>>
>> On Tue, Jun 24, 2008 at 6:00 PM, Johnny Billquist <bqt@REDACTED> wrote:
>>
>>  No. TCP don't acknowledge every packet. In fact, TCP don't acknowledge
>>> packets as such at all. TCP is not packet based. It's just that if you
>>> use
>>> IP as the carrier, IP itself it packet based.
>>> TCP can in theory generate any number of packets per second. However, the
>>> amount of unacknowledged data that can be outstanding at any time is
>>> limited
>>> by the transmit window. Each packet carries a window size, which is how
>>> much
>>> more data that can be accepted by the reciever. TCP can (is allowed to)
>>> send
>>> that much data and no more.
>>>
>>> The RTT calculations are used for figuring out how long to wait before
>>> doing retransmissions. You also normally have a slow start transmission
>>> algorithm which prevents the sender from even using the full window size
>>> from the start, as a way of avoiding congestions. That is used in
>>> combination with a backoff algorithm when retransmissions are needed to
>>> further decrease congestions, but all of this only really comes into
>>> effect
>>> if you start loosing data, and TCP actually needs to do retransmissions.
>>>
>>> Another thing you have is an algorithm called Nagle, which tries to
>>> collect
>>> small amount of data sent into larger packets before sending it, so that
>>> you
>>> don't flood the net with silly small packets.
>>>
>>> One addisional detail is that receivers normally, when the receive
>>> buffers
>>> becomes full, don't announce newly freed space immediately, since that is
>>> normally rather small amounts, but instead wait a while, until a larger
>>> part
>>> of the receive buffer is free, so that the sender actually can send some
>>> full sized packets once it starts sending again.
>>>
>>> In addition to all this, you also have a max segment size which is
>>> negotiated between the TCP ends, which limit the size of a single IP
>>> packet
>>> sent by the TCP protocol. This is done in order to try to avoid packet
>>> fragmentation.
>>>
>>> So the window size is actually a flow control mechanism, and is in
>>> reality
>>> limiting the amount of data that can be sent. And it varies all the time.
>>> And the number of packets that will be used for sending that much data is
>>> determined by the MSS (Max Segment Size).
>>>
>>> Sorry for the long text on how TCP works. :-)
>>>
>>>       Johnny
>>>
>>> Edwin Fine wrote:
>>>
>>>  David,
>>>>
>>>> Thanks for trying out the benchmark.
>>>>
>>>> With my limited knowledge of TCP/IP, I believe you are seeing the
>>>> 300,000
>>>> limit because TCP/IP requires acknowledgements to each packet, and
>>>> although
>>>> it can batch up multiple acknowledgements in one packet, there is a
>>>> theoretical limit of packets per seconds beyond which it cannot go due
>>>> to
>>>> the laws of physics. I understand that limit is determined by the
>>>> Round-Trip
>>>> Time (RTT), which can be shown by ping. On my system, pinging 127.0.0.1<
>>>> http://127.0.0.1> gives a minimum RTT of 0.018 ms (out of 16 pings).
>>>> That
>>>> means that the maximum number of packets that can make it to and dest
>>>> and
>>>> back per second is 1/0.000018 seconds, or 55555 packets per second. The
>>>> TCP/IP stack is evidently packing 5 or 6 blocks into each packet to get
>>>> the
>>>> 300K blocks/sec you are seeing. Using Wireshark or Ethereal would
>>>> confirm
>>>> this. I am guessing that this means that the TCP window is about 6 *
>>>> 1000
>>>> bytes or 6KB.
>>>>
>>>> What I neglected to tell this group is that I have modified the Linux
>>>> sysctl.conf as follows, which might have had an effect (like I said, I
>>>> am
>>>> not an expert):
>>>>
>>>> # increase Linux autotuning TCP buffer limits
>>>> # min, default, and max number of bytes to use
>>>> # set max to at least 4MB, or higher if you use very high BDP paths
>>>> net.ipv4.tcp_rmem = 4096 87380 16777216
>>>> net.ipv4.tcp_wmem = 4096 32768 16777216
>>>>
>>>> When I have more time, I will vary a number of different Erlang TCP/IP
>>>> parameters and get a data set together that gives a broader picture of
>>>> the
>>>> effect of the parameters.
>>>>
>>>> Thanks again for taking the time.
>>>>
>>>> 2008/6/24 David Mercer <dmercer@REDACTED <mailto:dmercer@REDACTED>>:
>>>>
>>>>   I tried some alternative block sizes (using the blksize option).  I
>>>>   found that from 1 to somewhere around––maybe a bit short of––1000
>>>>   bytes, the test was able to send about 300,000 blocks in 10 seconds
>>>>   regardless of size.  (That means, 0.03 MB/sec for block size of 1,
>>>>   0.3 MB/sec for block size of 10, 3 MB/sec  for block size of 100,
>>>>   etc.)  I suspect the system was CPU bound at those levels.
>>>>
>>>>
>>>>   Above 1000, the number of blocks sent seemed to decrease, though
>>>>   this was more than offset by the increased size of the blocks.
>>>>  Above
>>>> about 10,000 byte blocks (may have been less, I didn't check
>>>>   any value between 4,000 and 10,000), however, performance peaked and
>>>>   block size no longer mattered: it always sent between 70 and 80
>>>>   MB/sec.  My machine is clearly slower than Edwin's…
>>>>
>>>>
>>>>   DBM
>>>>
>>>>
>>>>
>>>>
>>>>  ------------------------------------------------------------------------
>>>>
>>>>   *From:* erlang-questions-bounces@REDACTED
>>>>   <mailto:erlang-questions-bounces@REDACTED>
>>>>   [mailto:erlang-questions-bounces@REDACTED
>>>>   <mailto:erlang-questions-bounces@REDACTED>] *On Behalf Of *Rapsey
>>>>   *Sent:* Tuesday, June 24, 2008 14:01
>>>>   *To:* erlang-questions@REDACTED <mailto:erlang-questions@REDACTED
>>>> >
>>>>   *Subject:* Re: [erlang-questions] why is gen_tcp:send slow?
>>>>
>>>>
>>>>   You're using very large packets. I think the results would be much
>>>>   more telling if the packets would be a few kB at most. That is
>>>>   closer to most real life situations.
>>>>
>>>>
>>>>   Sergej
>>>>
>>>>   On Tue, Jun 24, 2008 at 8:43 PM, Edwin Fine
>>>>   <erlang-questions_efine@REDACTED
>>>>   <mailto:erlang-questions_efine@REDACTED>> wrote:
>>>>
>>>>   I wrote a small benchmark in Erlang to see how fast I could get
>>>>   socket communications to go. All the benchmark does is pump the same
>>>>   buffer to a socket for (by default) 10 seconds. It uses {active,
>>>>   once} each time, just like you do.
>>>>
>>>>   Server TCP options:
>>>>        {active, once},
>>>>           {reuseaddr, true},
>>>>           {packet, 0},
>>>>           {packet_size, 65536},
>>>>           {recbuf, 1000000}
>>>>
>>>>   Client TCP options:
>>>>           {packet, raw},
>>>>           {packet_size, 65536},
>>>>           {sndbuf, 1024 * 1024},
>>>>           {send_timeout, 3000}
>>>>
>>>>   Here are some results using Erlang R12B-3 (erl +K true in the Linux
>>>>   version):
>>>>
>>>>   Linux (Ubuntu 8.10 x86_64, Intel Core 2 Q6600, 8 GB):
>>>>   - Using localhost (127.0.0.1 <http://127.0.0.1>): 7474.14 MB in
>>>>
>>>>   10.01 secs (746.66 MB/sec)
>>>>   - Using 192.168.x.x IP address: 8064.94 MB in 10.00 secs (806.22
>>>>   MB/sec) [Don't ask me why it's faster than using loopback, I
>>>>   repeated the tests and got the same result]
>>>>
>>>>   Windows XP SP3 (32 bits), Intel Core 2 Duo E6600:
>>>>   - Using loopback: 2166.97 MB in 10.02 secs (216.35 MB/sec)
>>>>   - Using 192.168.x.x IP address: 2140.72 MB in 10.02 secs (213.75
>>>> MB/sec)
>>>>   - On Gigabit Ethernet to the Q6600 Linux box: 1063.61 MB in 10.02
>>>>   secs (106.17 MB/sec) using non-jumbo frames. I don't think my router
>>>>   supports jumbo frames.
>>>>
>>>>   There's undoubtedly a huge discrepancy between the two systems,
>>>>   whether because of kernel poll in Linux, or that it's 64 bits, or
>>>>   unoptimized Windows TCP/IP flags, I don't know. I don't believe it's
>>>>   the number of CPUs (there's only 1 process sending and one
>>>>   receiving), or the CPU speed (they are both 2.4 GHz Core 2s).
>>>>
>>>>   Maybe some Erlang TCP/IP gurus could comment.
>>>>
>>>>   I've attached the code for interest. It's not supposed to be
>>>>   production quality, so please don't beat me up :) although I am
>>>>   always open to suggestions for improvement. If you do improve it,
>>>>   I'd like to see what you've done. Maybe there is another simple
>>>>   Erlang tcp benchmark program out there (i.e. not Tsung), but I
>>>>   couldn't find one in a cursory Google search.
>>>>
>>>>   To run:
>>>>
>>>>   VM1:
>>>>
>>>>   tb_server:start(Port, Opts).
>>>>   tb_server:stop() to stop.
>>>>
>>>>   Port = integer()
>>>>   Opts = []|[opt()]
>>>>   opt() = {atom(), term()} (Accepts inet setopts options, too)
>>>>
>>>>   The server prints out the transfer rate (for simplicity).
>>>>
>>>>   VM2:
>>>>   tb_client(Host, Port, Opts).
>>>>
>>>>   Host = atom()|string() hostname or IP address
>>>>   Port, Opts as in tb_server
>>>>
>>>>   Runs for 10 seconds, sending a 64K buffer as fast as possible to
>>>>   Host/Port.
>>>>   You can change this to 20 seconds (e.g.) by adding the tupls
>>>>   {time_limit, 20000} to Opts.
>>>>   You can change buffer size by adding the tuple {blksize, Bytes} to
>>>> Opts.
>>>>
>>>>   2008/6/20 Rapsey <rapsey@REDACTED <mailto:rapsey@REDACTED>>:
>>>>
>>>>   All data goes through nginx which acts as a proxy. Its CPU
>>>>   consumption is never over 1%.
>>>>
>>>>
>>>>   Sergej
>>>>
>>>>
>>>>   On Thu, Jun 19, 2008 at 9:35 PM, Javier París Fernández
>>>>   <javierparis@REDACTED <mailto:javierparis@REDACTED>> wrote:
>>>>
>>>>
>>>>   El 19/06/2008, a las 20:06, Rapsey escribió:
>>>>
>>>>
>>>>       It loops from another module, that way I can update the code at
>>>>       any time without disrupting anything.
>>>>       The packets are generally a few hundred bytes big, except
>>>>       keyframes which tend to be in the kB range. I haven't tried
>>>>       looking with wireshark.  Still it seems a bit odd that a large
>>>>       CPU consumption would be the symptom. The traffic is strictly
>>>>       one way. Either someone is sending the stream or receiving it.
>>>>       The transmit could of course be written with a passive receive,
>>>>       but the code would be significantly uglier. I'm sure someone
>>>>       here knows if setting {active, once} every packet is CPU
>>>>       intensive or not.
>>>>       It seems the workings of gen_tcp is quite platform dependent. If
>>>>       I run the code in windows, sending more than 128 bytes per
>>>>       gen_tcp call significantly decreases network output.
>>>>       Oh and I forgot to mention I use R12B-3.
>>>>
>>>>
>>>>   Hi,
>>>>
>>>>   Without being an expert.
>>>>
>>>>   200-300 mb/s  in small (hundreds of bytes) packets means a *lot* of
>>>>   system calls if you are doing a gen_tcp:send for each one. If you
>>>>   buffer 3 packets, you are reducing that by a factor of 3 :). I'd try
>>>>   to do an small test doing the same thing in C and compare the
>>>>   results. I think it will also eat a lot of CPU.
>>>>
>>>>   About the proxy CPU... I'm a bit lost about it, but speculating
>>>>   wildly it is possible that the time spent doing the system calls
>>>>   that gen_tcp is doing is added to the proxy CPU process.
>>>>
>>>>   Regards.
>>>>
>>>>
>>>>
>>>>   _______________________________________________
>>>>   erlang-questions mailing list
>>>>   erlang-questions@REDACTED <mailto:erlang-questions@REDACTED>
>>>>   http://www.erlang.org/mailman/listinfo/erlang-questions
>>>>
>>>>
>>>>
>>>>
>>>>   _______________________________________________
>>>>   erlang-questions mailing list
>>>>   erlang-questions@REDACTED <mailto:erlang-questions@REDACTED>
>>>>   http://www.erlang.org/mailman/listinfo/erlang-questions
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------
>>>>
>>>> _______________________________________________
>>>> erlang-questions mailing list
>>>> erlang-questions@REDACTED
>>>> http://www.erlang.org/mailman/listinfo/erlang-questions
>>>>
>>>>
>>>
>>>
>>
>
> --
> Johnny Billquist                  || "I'm on a bus
>                                  ||  on a psychedelic trip
> email: bqt@REDACTED             ||  Reading murder books
> pdp is alive!                     ||  tryin' to stay hip" - B. Idol
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20080624/41343c5c/attachment.htm>