[erlang-questions] Erlang distribution links don't fully utilise available resources - OTP 22.0.2 - Why?

Mon Jun 17 16:48:45 CEST 2019

Hi,

We are trying to understand what prevents the Erlang distribution link from
saturating the network. Even though there is plenty of CPU, memory &
network bandwidth, the Erlang distribution doesn't fully utilise available
resources. Can you help us figure out why?

We have a 3-node Erlang 22.0.2 cluster running on Ubuntu 16.04 x86 64bit.

This is the maximum network throughput between node-a & node-b, as measured
by iperf:

iperf -t 30 -c node-b
------------------------------------------------------------
Client connecting to 10.0.1.37, TCP port 5001
TCP window size: 45.0 KByte (default)
------------------------------------------------------------
[  3] local 10.0.1.36 port 43576 connected with 10.0.1.37 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-30.0 sec  78.8 GBytes  22.6 Gbits/sec

We ran this multiple times, in different directions & with different degree
of parallelism, the maximum network throughput is roughly 22 Gbit/s.

We run the following command on node-a:

B = fun F() -> rpc:cast('foo@REDACTED', erlang, is_binary,
[<<0:10000000/unit:8>>]), F() end.
[spawn(fun() -> B() end) || _ <- lists:seq(1, 100)].

This is what the network reports on node-a:

dstat -n 1 10
-net/total-
 recv  send
   0     0
 676k  756M
 643k  767M
 584k  679M
 693k  777M
 648k  745M
 660k  745M
 667k  772M
 651k  709M
 675k  782M
 688k  819M

That roughly translates to 6 Gbit/s. In other words, the Erlang
distribution link between node-a & node-b is maxing out at around ~6
Gbit/s. Erlang distribution is limited to 27% of what we are measuring
consistently and repeatedly outside of Erlang. In other words, iperf is
3.6x faster than an Erlang distribution link. It gets better.

If we start another 100 processes pumping 10Mbyte messages from node-a to
node-c, we see the network throughput double:

dstat -n 1 10
-net/total-
 recv  send
   0     0
1303k 1463M
1248k 1360M
1332k 1458M
1480k 1569M
1339k 1455M
1413k 1494M
1395k 1431M
1359k 1514M
1438k 1564M
1379k 1489M

So 2 distribution links - each to a separate node - utilise 12Gbit/s out of
the 22Gbit/s available on node-a.

What is preventing the distribution links pushing more data through? There
is plenty of CPU & memory available (all nodes have 16 CPUs & 104GB MEM -
n1-highmem-16):

dstat -cm 1 10
----total-cpu-usage---- ------memory-usage-----
usr sys idl wai hiq siq| used  buff  cach  free
 10   6  84   0   0   1|16.3G  118M  284M 85.6G
 20   6  73   0   0   1|16.3G  118M  284M 85.6G
 20   6  74   0   0   0|16.3G  118M  284M 85.6G
 18   6  76   0   0   0|16.4G  118M  284M 85.5G
 19   6  74   0   0   1|16.4G  118M  284M 85.4G
 17   4  78   0   0   0|16.5G  118M  284M 85.4G
 20   6  74   0   0   0|16.5G  118M  284M 85.4G
 19   6  74   0   0   0|16.5G  118M  284M 85.4G
 19   5  76   0   0   1|16.5G  118M  284M 85.4G
 18   6  75   0   0   0|16.5G  118M  284M 85.4G
 18   6  75   0   0   0|16.6G  118M  284M 85.3G

The only smoking gun is the distribution output queue buffer:
https://grafana.gcp.rabbitmq.com/dashboard/snapshot/H329EfN3SFhsveA20ei7jC7JMFHAm8Ru?orgId=1&fullscreen&panelId=62

Speaking of which, we look forward to erlang/otp#2270 being merged:
https://github.com/erlang/otp/pull/2270

All distribution metrics are available here:
https://grafana.gcp.rabbitmq.com/dashboard/snapshot/H329EfN3SFhsveA20ei7jC7JMFHAm8Ru?orgId=1

If you want to see the state of distribution links & dist process state
(they are all green btw), check the point-in-time metrics (they will expire
in 15 days from today):
https://grafana.gcp.rabbitmq.com/d/d-SFCCmZz/erlang-distribution?from=1560775955127&to=1560779424482

How can we tell what is preventing the distribution link from using all
available bandwidth?

Are we missing a configuration flag? These are all the relevant beam.smp
flags that we are using:
https://github.com/erlang/otp/pull/2270#issuecomment-500953352
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20190617/6f6f313c/attachment.htm>