[erlang-questions] Erlang TCP throughput slowdown

Tue Mar 19 13:30:59 CET 2019

Hi everyone, this is a long email, so apologies in advance.

I’m currently benchmarking an Erlang service that consists of a set of client and servers, distributed among sites. Clients distribute their requests uniformly among all the servers, irrespective of the site they are in.

The initial results, where all the clients and servers are located in the same site, showed a combined maximum throughput of ~620k requests per second, with a mean latency of ~7.4 ms. This is already past the saturation point of the service, but it illustrates the baseline performance. The mean round-trip latency between clients and servers here is of around 0.25 ms.

However, as I distribute the same number of physical machines among sites, the throughput decreases (I’ve only tested this with two sites, but it seems to decrease linearly). In the same configuration, but with a 10ms rtt latency between the sites, the throughput decreases to ~350k requests per second, with a ~12ms mean latency (again, past the saturation point).

Apart from the increase in latency between sites, no other configuration is changed. My initial expectation was that the throughput would stay the same, even if the base latency would increase, and that the saturation point would be reached at approximately the same number of requests per second.

To rule out any application-specific configurations, I replicated the experiment with a simple TCP echo server (the results are almost the same). I’ve detailed the experiment below, in case someone wants to take a closer look.

I’m by no means an expert at Erlang, or know very much about TCP, but I’m puzzled by how this could be, and I’ve already lost way too many hours trying to figure out where the performance is going. So any piece of advice would be extremely appreciated.

Cheers,
Borja

Experiment details follow:

# Technical setup

All machines (Dell Poweredge R430, http://i.dell.com/sites/doccontent/shared-content/data-sheets/en/Documents/Dell-PowerEdge-R430-Spec-Sheet.pdf) running Ubuntu 16.04.1 LTS and Erlang/OTP 19 [erts-8.3.5] [source] [64-bit] [smp:32:32], with:

* Two 2.4 GHz 64-bit 8-Core Xeon E5-2630v3 processors, 8.0 GT/s, 20 MB cache
* 64 GB 2133 MT/s DDR4 RAM (8 x 8GB modules)
* 2-4 Intel i350 GbE NICs

# Experiment setup

Servers and clients are partitioned in sites. Inside a single site, all servers and client machines are linked to a single LAN. To connect between sites, a single physical machine is used as a switch between all the LANs on the different sites. Artificial latency is added on this switch node using `tc`.

The servers don't know about each other, and only implement a simple TCP echo server using ranch as an acceptor pool. Each client node (using basho_bench) will create several worker processes to execute a simple ping-pong benchmark. At the beginning of the benchmark, a worker process will open TCP connections to each of the servers, and re-use them for requests throughout the entire benchmark.

For each request, it will choose a server at random (in an uniform fashion), send a random payload of 1KB, and wait synchronously for the same payload to be sent back.

Code is available on github for the server (https://github.com/ergl/ranch_test), the client (https://github.com/ergl/lasp-bench/blob/coord/src/ranch_test_bench.erl) and the benchmark settings (https://github.com/ergl/lasp-bench/blob/coord/examples/ranch_test.config).

# Tuning Options

Several socket options:

* Ranch acceptor pool options `[{num_acceptors, erlang:system_info(schedulers_online)}, {max_connections, infinity}]` (The number of schedulers is 32, in this case).
* Server socket options `[{active, once}, {packet, 2}]`
* Client socket options `[binary, {active, false}, {packet, 2}, {nodelay, true}]`

Erlang options on the server nodes (some of these options are chosen to replicate the original application):

```
+K true
+A 5
-env ERL_MAX_PORTS 4096
-env ERL_FULLSWEEP_AFTER 10
-env ERL_MAX_ETS_TABLES 50000
+zdbbl 32768
+C no_time_warp
+hmax 12500000
+hmaxk false
+stbt tnnps
```

# The Problem

For this experiment, we have two sites, with two server and two client machines each.

We distinguish two scenarios: one with no latency added between sites (0.25 ms rtt), and another with an extra 10ms added (10ms rtt). Several 5-minute runs are performed, each with an increasing number of concurrent connections (from 50 up to 1000, so each server is holding 200 to 4000 concurrent connections). The results are in the following tables:

Scenario A, no latency added

| Clients | Total Clients | Max Throughput (reqs/sec) | Mean Latency (ms) |
|   50      |      200          |         457,279.5                    |     0.4397939     |
|   100    |      400          |         514,436.5                      |     0.7933541     |
|   500    |     2,000        |         594,367.7                      |     3.524144       |
|   750    |     3,000        |         613,786.7                      |     5.600492       |
|  1,000  |     4,000        |         627,434.9                      |      7.39232        |

Scenario B, 10ms rtt

| Clients | Total Clients | Max Throughput (reqs/sec) | Mean Latency (ms) |
|   50      |      200         |         38,660.02                       |     5.182687       |
|   100    |      400         |         77,619.95                       |     5.167778       |
|   500    |     2,000       |         348,516.5                       |     5.901048       |
|   750    |     3,000       |         344,605.3                       |     8.961493       |
|  1,000  |     4,000       |         340,299.5                       |     12.05045       |