[erlang-questions] Erlang TCP throughput slowdown

Fri Mar 22 01:42:55 CET 2019

Thanks for the great reply, Jesper! Please see some updated comments below

> On 19 Mar 2019, at 16:04, Jesper Louis Andersen <jesper.louis.andersen@REDACTED> wrote:
> 
> Some napkin math: Suppose you have 1 connection. You have a lower RTT of 10ms. At most, this is 100 req/s on that connection. Suppose we have 500 of those connections. Then the maximal req/s is 500*100 = 50,000. 

Right, I had the intuition that with a larger latency, I’d need more clients to reach the same throughput. In the numbers I shared in the previous email, you can see that I’ve almost reached the maximum number of req/s with 4,000 connections (with your calculations, it should be 400,000 req/s, and I get 340,000). In this benchmark, the clients can either route requests to a node on another site (10ms RTT), or stay within a site (0.2 ms RTT), so the upper bound will be larger, depending on the distribution.

But the thing is, 4,000 concurrent connections don’t sound like a lot (especially so if they’re not doing a lot), so I’m surprised that even at 340k req/s, I’m hitting a mean latency of 12ms (and with a 99th percentile of 230ms, to boot). Why is it so overloaded? I would expect a beefy machine to cope with that load (the actual saturation point seems to be reached at 2000 concurrent connections, as the throughput drops after that).

> The other problem is that your load generator coordinates which leads to coordinated omission[0]. The load generator only issues the next request once the previous one completes. It is usually better to keep the bandwidth usage constant and then measure latency, counting a late request against the system.
> 
> The astute reader will also observe you measure the mean latency. This is not very useful, and you should look at either a boxplot, kernel density plot, histogram. or the like. If you know the data is normally distributed with the same standard deviation, then your average latency makes sense as a comparable value. But this requires you plot the data, look at it and make sure they have that shape. Otherwise you can be led astray.

Thank you for pointing this out. I was not very clear on my first email about this. The bench is storing the results in a histogram, and performing statistics on a sliding window of 5 secs. For each client machine, I get the mean, and the 50th, 95, 99 and 99.9th percentile. However, when aggregating all the results for all the client machines, I only report the min, mean and max values, since I can’t meaningfully aggregate the other results (but they are still useful to see if one machine is  performing worse than the other). In all these tests I’m forcing an uniform distribution of requests, but for future tests this would change.

As for the coordinated omission problem, thanks for reminding me! I’ve watched Gil Tene’s talk a few times, but it’s one of those things that you forget if you don’t work on benchmarking very often. I know hdrhistogram can sort of compensate for missing values if you know the expected intervals between requests, but afaik folsom (https://github.com/folsom-project/folsom), the library basho bench uses, can’t do that.

> Now to solutions:
> 
> […]
> 
> Batching is an option. Collect multiple requests and send them all off at the same time.This effectively makes sure you can have multiple requests inflight, which gets you around the delay constant. It also allows a smart server to process multiple requests simultaneously, thus shedding load. Microbatching is alluring here: when the first request arrives, set a cork for 5ms. Upon having read either 500 reqs or the timer triggering, process whatever you have in the batch. Then start all over. This is used by e.g., Kafka and TensorFlow, the latter to avoid the memory bottleneck between DRAM and GPU.

This sounds interesting. I’d have to rethink the architecture of the server, and I’d be worried about potentially introducing a bottleneck in the system if not implementing this correctly (as I’d need to push all client requests through a single process acting as a buffer to keep track of them?), but I’ll keep it mind.

> Pipelining is another option. Send multiple requests back to back, so they are all in the network. Interlace your receive loop with new requests being sent, so you can keep the work going. You should consider tagging messages with unique identifiers so they can be told apart. This allows out-of-order processing. See plan9's 9p protocol, RabbitMQ/AMQP's correlation IDs, or HTTP/2. Quick implementation: Loic Hoguin's Cowboy/Gun combo works wonders here, and uses the same code base (cowlib). This will avoid the wait time effectively. 

Thanks for the suggestion. I’m not 100% tied to TCP only, and could probably use HTTP so I can use Cowboy and Gun and avoid having to implement pipelining myself.

Again, thanks for all the help!

Cheers,
Borja