[erlang-questions] why is gen_tcp:send slow?
Johnny Billquist
bqt@REDACTED
Wed Jun 25 09:44:04 CEST 2008
Hi,
no, unfortunately I don't have an answer for their observed performance.
Although I haven't really looked at it either. Time, you know... :-)
But keep testing and try to figure it out. You always learn something, and
hopefully you'll find the answer as well.
Johnny
Edwin Fine skrev:
> Ok Johnny, I think I get it now. Thanks for the detailed explanation. I
> wonder why the original poster (Sergej/Rapsey) is seeing such poor TCP/IP
> performance? In any case, I am still going to do some more benchmarks to see
> if I can understand how the different components of TCP/IP communication in
> Erlang (inet:setopts() and gen_tcp) affect performance, CPU overhead and so
> on.
>
> The reason I got into all this is because I was seeing very good performance
> between two systems on a LAN, and terrible performance over a non-local
> overseas link that had an RTT of about 290ms. Through various measurements
> and Wireshark usage I found the link was carrying only 3.4 packets per
> second, with only about 56 data bytes in each packet. When I investigated
> further, I found that a function I thought was running asynchronously was
> actually running synchronously inside a gen_server:call(). When I spawned
> the function, I still only saw 3.4 packets per second (using Wireshark
> timestamps) but each packet was now full of multiple blocks of data, not
> just 56 bytes, so the actual throughput went up hugely. Nothing else
> changed. When I tried to find out where the 3.4 was coming from, I
> calculated 1/3.4 = 0.294ms which was (coincidentally?) the exact RTT. That's
> why I thought there was a relationship between RTT and the number of
> packets/second a link could carry.
>
> Now I have to go back and try to figure it all all over again :( unless you
> can explain it to me (he said hopefully).
>
> Thanks
> Ed
>
>
> On Tue, Jun 24, 2008 at 7:15 PM, Johnny Billquist <bqt@REDACTED> wrote:
>
>> Edwin, always happy to help out...
>>
>> Edwin Fine skrev:
>>
>>> Johnny,
>>>
>>> Thanks for the lesson! I am always happy to learn. Like I said, I am not
>>> an
>>> expert in TCP/IP.
>>>
>>> What I was writing about when I said that packets are acknowledged is what
>>> I
>>> saw in Wireshark while trying to understand performance issues. I perhaps
>>> should have said "TCP/IP" instead of just "TCP". There were definitely
>>> acknowledgements, but I guess they were at the IP level.
>>>
>> No. IP don't have any acknowledgements. IP (as well as UDP) is basically
>> just sending packets without any guarantee that they will ever reach the
>> other end.
>> What you saw was TCP acknowledgements, but you misunderstood how they work.
>>
>> Think of a TCP connection as an eternal length stream of bytes. Each byte
>> in this stream have a sequence number. TCP sends bytes in this stream,
>> packed into IP packets. Each IP packet will have one or several bytes from
>> that stream.
>> TCP at the other end will acknowledge the highest ordered byte that is has
>> received. How many packets it took to get to that byte is irrelevant, as is
>> any retransmissions, and so on... The window size tells how many additional
>> bytes from this stream can be sent, which is further on based in the point
>> which the acknowledgement points at.
>>
>> (In reality, the sequence numbers are not infitite, but are actually a
>> 32-bit number, which wraps. But since window sizes normally fits in a 16-bit
>> quantity, there is no chance of ever getting back to the same sequence
>> number again before it has long been passed by the time before, so no risk
>> of confusion or errors there.)
>>
>> I wonder what the MSS is for loopback? I think it's about 1536 on my eth0
>>> interface, but not sure.
>>>
>> Smart implementations use the MTU - 40 of the interface as the MSS for a
>> loopback connection. Otherwise the thumb of rule is that if it's on the same
>> network, MSS is usually set to 1460 and 536 for destinations on other
>> networks.
>> This comes from the fact that the local network (usually ethernet) have an
>> MTU of 1500, and the IP header is normally 20 bytes, and so is the standard
>> TCP header, leaving 1460 bytes of data in an ethernet frame.
>> For non-local destinations, IP requires that atleast 576 byte packets can
>> go through unfragmented. The rest follows. :-)
>>
>> As for RTT, I sent data over a link that had a very long (290ms) RTT, and
>>> that definitely limited the rate at which packets could be sent. Can RTT
>>> be
>>> used to calculate the theoretical maximum traffic that a link can carry?
>>> For example, a satellite link with a 400ms RTT but 2 Mbps bandwidth?
>>>
>> No. RTT can not be used to calculate anything regarding traffic bandwidth.
>> You can keep sending packets until the window is exhausted, no matter what
>> the RTT says. The RTT is only used to calculate when to do retransmissions
>> if you haven't received an ACK.
>> The only other thing that affects packet rates are the slow start
>> algorithm. That will be affected by the round trip delays, since it adds a
>> throttling effect on the window, in addition to what the received says. The
>> reason for it being affected by the rount trip delay is that the slow start
>> window size is only increased when you get ACK packets back.
>> But, assuming the link can take the load, and you don't loose a lot of
>> packets, the slow start algorithm will pretty quickly stop being a factor.
>>
>> Johnny
>>
>>
>>
>>> Ed
>>>
>>> On Tue, Jun 24, 2008 at 6:00 PM, Johnny Billquist <bqt@REDACTED> wrote:
>>>
>>> No. TCP don't acknowledge every packet. In fact, TCP don't acknowledge
>>>> packets as such at all. TCP is not packet based. It's just that if you
>>>> use
>>>> IP as the carrier, IP itself it packet based.
>>>> TCP can in theory generate any number of packets per second. However, the
>>>> amount of unacknowledged data that can be outstanding at any time is
>>>> limited
>>>> by the transmit window. Each packet carries a window size, which is how
>>>> much
>>>> more data that can be accepted by the reciever. TCP can (is allowed to)
>>>> send
>>>> that much data and no more.
>>>>
>>>> The RTT calculations are used for figuring out how long to wait before
>>>> doing retransmissions. You also normally have a slow start transmission
>>>> algorithm which prevents the sender from even using the full window size
>>>> from the start, as a way of avoiding congestions. That is used in
>>>> combination with a backoff algorithm when retransmissions are needed to
>>>> further decrease congestions, but all of this only really comes into
>>>> effect
>>>> if you start loosing data, and TCP actually needs to do retransmissions.
>>>>
>>>> Another thing you have is an algorithm called Nagle, which tries to
>>>> collect
>>>> small amount of data sent into larger packets before sending it, so that
>>>> you
>>>> don't flood the net with silly small packets.
>>>>
>>>> One addisional detail is that receivers normally, when the receive
>>>> buffers
>>>> becomes full, don't announce newly freed space immediately, since that is
>>>> normally rather small amounts, but instead wait a while, until a larger
>>>> part
>>>> of the receive buffer is free, so that the sender actually can send some
>>>> full sized packets once it starts sending again.
>>>>
>>>> In addition to all this, you also have a max segment size which is
>>>> negotiated between the TCP ends, which limit the size of a single IP
>>>> packet
>>>> sent by the TCP protocol. This is done in order to try to avoid packet
>>>> fragmentation.
>>>>
>>>> So the window size is actually a flow control mechanism, and is in
>>>> reality
>>>> limiting the amount of data that can be sent. And it varies all the time.
>>>> And the number of packets that will be used for sending that much data is
>>>> determined by the MSS (Max Segment Size).
>>>>
>>>> Sorry for the long text on how TCP works. :-)
>>>>
>>>> Johnny
>>>>
>>>> Edwin Fine wrote:
>>>>
>>>> David,
>>>>> Thanks for trying out the benchmark.
>>>>>
>>>>> With my limited knowledge of TCP/IP, I believe you are seeing the
>>>>> 300,000
>>>>> limit because TCP/IP requires acknowledgements to each packet, and
>>>>> although
>>>>> it can batch up multiple acknowledgements in one packet, there is a
>>>>> theoretical limit of packets per seconds beyond which it cannot go due
>>>>> to
>>>>> the laws of physics. I understand that limit is determined by the
>>>>> Round-Trip
>>>>> Time (RTT), which can be shown by ping. On my system, pinging 127.0.0.1<
>>>>> http://127.0.0.1> gives a minimum RTT of 0.018 ms (out of 16 pings).
>>>>> That
>>>>> means that the maximum number of packets that can make it to and dest
>>>>> and
>>>>> back per second is 1/0.000018 seconds, or 55555 packets per second. The
>>>>> TCP/IP stack is evidently packing 5 or 6 blocks into each packet to get
>>>>> the
>>>>> 300K blocks/sec you are seeing. Using Wireshark or Ethereal would
>>>>> confirm
>>>>> this. I am guessing that this means that the TCP window is about 6 *
>>>>> 1000
>>>>> bytes or 6KB.
>>>>>
>>>>> What I neglected to tell this group is that I have modified the Linux
>>>>> sysctl.conf as follows, which might have had an effect (like I said, I
>>>>> am
>>>>> not an expert):
>>>>>
>>>>> # increase Linux autotuning TCP buffer limits
>>>>> # min, default, and max number of bytes to use
>>>>> # set max to at least 4MB, or higher if you use very high BDP paths
>>>>> net.ipv4.tcp_rmem = 4096 87380 16777216
>>>>> net.ipv4.tcp_wmem = 4096 32768 16777216
>>>>>
>>>>> When I have more time, I will vary a number of different Erlang TCP/IP
>>>>> parameters and get a data set together that gives a broader picture of
>>>>> the
>>>>> effect of the parameters.
>>>>>
>>>>> Thanks again for taking the time.
>>>>>
>>>>> 2008/6/24 David Mercer <dmercer@REDACTED <mailto:dmercer@REDACTED>>:
>>>>>
>>>>> I tried some alternative block sizes (using the blksize option). I
>>>>> found that from 1 to somewhere around––maybe a bit short of––1000
>>>>> bytes, the test was able to send about 300,000 blocks in 10 seconds
>>>>> regardless of size. (That means, 0.03 MB/sec for block size of 1,
>>>>> 0.3 MB/sec for block size of 10, 3 MB/sec for block size of 100,
>>>>> etc.) I suspect the system was CPU bound at those levels.
>>>>>
>>>>>
>>>>> Above 1000, the number of blocks sent seemed to decrease, though
>>>>> this was more than offset by the increased size of the blocks.
>>>>> Above
>>>>> about 10,000 byte blocks (may have been less, I didn't check
>>>>> any value between 4,000 and 10,000), however, performance peaked and
>>>>> block size no longer mattered: it always sent between 70 and 80
>>>>> MB/sec. My machine is clearly slower than Edwin's…
>>>>>
>>>>>
>>>>> DBM
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------
>>>>>
>>>>> *From:* erlang-questions-bounces@REDACTED
>>>>> <mailto:erlang-questions-bounces@REDACTED>
>>>>> [mailto:erlang-questions-bounces@REDACTED
>>>>> <mailto:erlang-questions-bounces@REDACTED>] *On Behalf Of *Rapsey
>>>>> *Sent:* Tuesday, June 24, 2008 14:01
>>>>> *To:* erlang-questions@REDACTED <mailto:erlang-questions@REDACTED
>>>>> *Subject:* Re: [erlang-questions] why is gen_tcp:send slow?
>>>>>
>>>>>
>>>>> You're using very large packets. I think the results would be much
>>>>> more telling if the packets would be a few kB at most. That is
>>>>> closer to most real life situations.
>>>>>
>>>>>
>>>>> Sergej
>>>>>
>>>>> On Tue, Jun 24, 2008 at 8:43 PM, Edwin Fine
>>>>> <erlang-questions_efine@REDACTED
>>>>> <mailto:erlang-questions_efine@REDACTED>> wrote:
>>>>>
>>>>> I wrote a small benchmark in Erlang to see how fast I could get
>>>>> socket communications to go. All the benchmark does is pump the same
>>>>> buffer to a socket for (by default) 10 seconds. It uses {active,
>>>>> once} each time, just like you do.
>>>>>
>>>>> Server TCP options:
>>>>> {active, once},
>>>>> {reuseaddr, true},
>>>>> {packet, 0},
>>>>> {packet_size, 65536},
>>>>> {recbuf, 1000000}
>>>>>
>>>>> Client TCP options:
>>>>> {packet, raw},
>>>>> {packet_size, 65536},
>>>>> {sndbuf, 1024 * 1024},
>>>>> {send_timeout, 3000}
>>>>>
>>>>> Here are some results using Erlang R12B-3 (erl +K true in the Linux
>>>>> version):
>>>>>
>>>>> Linux (Ubuntu 8.10 x86_64, Intel Core 2 Q6600, 8 GB):
>>>>> - Using localhost (127.0.0.1 <http://127.0.0.1>): 7474.14 MB in
>>>>>
>>>>> 10.01 secs (746.66 MB/sec)
>>>>> - Using 192.168.x.x IP address: 8064.94 MB in 10.00 secs (806.22
>>>>> MB/sec) [Don't ask me why it's faster than using loopback, I
>>>>> repeated the tests and got the same result]
>>>>>
>>>>> Windows XP SP3 (32 bits), Intel Core 2 Duo E6600:
>>>>> - Using loopback: 2166.97 MB in 10.02 secs (216.35 MB/sec)
>>>>> - Using 192.168.x.x IP address: 2140.72 MB in 10.02 secs (213.75
>>>>> MB/sec)
>>>>> - On Gigabit Ethernet to the Q6600 Linux box: 1063.61 MB in 10.02
>>>>> secs (106.17 MB/sec) using non-jumbo frames. I don't think my router
>>>>> supports jumbo frames.
>>>>>
>>>>> There's undoubtedly a huge discrepancy between the two systems,
>>>>> whether because of kernel poll in Linux, or that it's 64 bits, or
>>>>> unoptimized Windows TCP/IP flags, I don't know. I don't believe it's
>>>>> the number of CPUs (there's only 1 process sending and one
>>>>> receiving), or the CPU speed (they are both 2.4 GHz Core 2s).
>>>>>
>>>>> Maybe some Erlang TCP/IP gurus could comment.
>>>>>
>>>>> I've attached the code for interest. It's not supposed to be
>>>>> production quality, so please don't beat me up :) although I am
>>>>> always open to suggestions for improvement. If you do improve it,
>>>>> I'd like to see what you've done. Maybe there is another simple
>>>>> Erlang tcp benchmark program out there (i.e. not Tsung), but I
>>>>> couldn't find one in a cursory Google search.
>>>>>
>>>>> To run:
>>>>>
>>>>> VM1:
>>>>>
>>>>> tb_server:start(Port, Opts).
>>>>> tb_server:stop() to stop.
>>>>>
>>>>> Port = integer()
>>>>> Opts = []|[opt()]
>>>>> opt() = {atom(), term()} (Accepts inet setopts options, too)
>>>>>
>>>>> The server prints out the transfer rate (for simplicity).
>>>>>
>>>>> VM2:
>>>>> tb_client(Host, Port, Opts).
>>>>>
>>>>> Host = atom()|string() hostname or IP address
>>>>> Port, Opts as in tb_server
>>>>>
>>>>> Runs for 10 seconds, sending a 64K buffer as fast as possible to
>>>>> Host/Port.
>>>>> You can change this to 20 seconds (e.g.) by adding the tupls
>>>>> {time_limit, 20000} to Opts.
>>>>> You can change buffer size by adding the tuple {blksize, Bytes} to
>>>>> Opts.
>>>>>
>>>>> 2008/6/20 Rapsey <rapsey@REDACTED <mailto:rapsey@REDACTED>>:
>>>>>
>>>>> All data goes through nginx which acts as a proxy. Its CPU
>>>>> consumption is never over 1%.
>>>>>
>>>>>
>>>>> Sergej
>>>>>
>>>>>
>>>>> On Thu, Jun 19, 2008 at 9:35 PM, Javier París Fernández
>>>>> <javierparis@REDACTED <mailto:javierparis@REDACTED>> wrote:
>>>>>
>>>>>
>>>>> El 19/06/2008, a las 20:06, Rapsey escribió:
>>>>>
>>>>>
>>>>> It loops from another module, that way I can update the code at
>>>>> any time without disrupting anything.
>>>>> The packets are generally a few hundred bytes big, except
>>>>> keyframes which tend to be in the kB range. I haven't tried
>>>>> looking with wireshark. Still it seems a bit odd that a large
>>>>> CPU consumption would be the symptom. The traffic is strictly
>>>>> one way. Either someone is sending the stream or receiving it.
>>>>> The transmit could of course be written with a passive receive,
>>>>> but the code would be significantly uglier. I'm sure someone
>>>>> here knows if setting {active, once} every packet is CPU
>>>>> intensive or not.
>>>>> It seems the workings of gen_tcp is quite platform dependent. If
>>>>> I run the code in windows, sending more than 128 bytes per
>>>>> gen_tcp call significantly decreases network output.
>>>>> Oh and I forgot to mention I use R12B-3.
>>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> Without being an expert.
>>>>>
>>>>> 200-300 mb/s in small (hundreds of bytes) packets means a *lot* of
>>>>> system calls if you are doing a gen_tcp:send for each one. If you
>>>>> buffer 3 packets, you are reducing that by a factor of 3 :). I'd try
>>>>> to do an small test doing the same thing in C and compare the
>>>>> results. I think it will also eat a lot of CPU.
>>>>>
>>>>> About the proxy CPU... I'm a bit lost about it, but speculating
>>>>> wildly it is possible that the time spent doing the system calls
>>>>> that gen_tcp is doing is added to the proxy CPU process.
>>>>>
>>>>> Regards.
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> erlang-questions mailing list
>>>>> erlang-questions@REDACTED <mailto:erlang-questions@REDACTED>
>>>>> http://www.erlang.org/mailman/listinfo/erlang-questions
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> erlang-questions mailing list
>>>>> erlang-questions@REDACTED <mailto:erlang-questions@REDACTED>
>>>>> http://www.erlang.org/mailman/listinfo/erlang-questions
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------
>>>>>
>>>>> _______________________________________________
>>>>> erlang-questions mailing list
>>>>> erlang-questions@REDACTED
>>>>> http://www.erlang.org/mailman/listinfo/erlang-questions
>>>>>
>>>>>
>>>>
>> --
>> Johnny Billquist || "I'm on a bus
>> || on a psychedelic trip
>> email: bqt@REDACTED || Reading murder books
>> pdp is alive! || tryin' to stay hip" - B. Idol
>>
>>
>
--
Johnny Billquist || "I'm on a bus
|| on a psychedelic trip
email: bqt@REDACTED || Reading murder books
pdp is alive! || tryin' to stay hip" - B. Idol
More information about the erlang-questions
mailing list