[erlang-questions] Erlang distribution links don't fully utilise available resources - OTP 22.0.2 - Why?

Mon Jun 17 17:02:06 CEST 2019

I believe the Erlang distribution is the wrong thing to use if you want to
saturate the network.
There is plenty of overhead for each incoming message, the data gets
copied, then encoded (copied again) then sent, then received (copied), then
decoded (copied again) and sent to the destination process (copied again).
Then the receiving processes might be slow to fetch the incoming data, they
aren't running in hard real time and sometimes go to sleep.

Something about Linux tuning can be googled, like thing here
https://medium.com/@_wmconsulting/tuning-linux-to-reach-maximum-performance-on-10-gbps-network-card-with-http-streaming-8599c9b4389d

I remember there were suggestions to use regular TCP connections, consider
using user-mode driver (kernel calls have a cost) and low level NIF driver
for that, with the intent of delivering highest gigabits from your
hardware.

On Mon, 17 Jun 2019 at 16:49, Gerhard Lazu <gerhard@REDACTED> wrote:

> Hi,
>
> We are trying to understand what prevents the Erlang distribution link
> from saturating the network. Even though there is plenty of CPU, memory &
> network bandwidth, the Erlang distribution doesn't fully utilise available
> resources. Can you help us figure out why?
>
> We have a 3-node Erlang 22.0.2 cluster running on Ubuntu 16.04 x86 64bit.
>
> This is the maximum network throughput between node-a & node-b, as
> measured by iperf:
>
> iperf -t 30 -c node-b
> ------------------------------------------------------------
> Client connecting to 10.0.1.37, TCP port 5001
> TCP window size: 45.0 KByte (default)
> ------------------------------------------------------------
> [  3] local 10.0.1.36 port 43576 connected with 10.0.1.37 port 5001
> [ ID] Interval       Transfer     Bandwidth
> [  3]  0.0-30.0 sec  78.8 GBytes  22.6 Gbits/sec
>
> We ran this multiple times, in different directions & with different
> degree of parallelism, the maximum network throughput is roughly 22 Gbit/s.
>
> We run the following command on node-a:
>
> B = fun F() -> rpc:cast('foo@REDACTED', erlang, is_binary, [<<0:10000000/unit:8>>]), F() end.
> [spawn(fun() -> B() end) || _ <- lists:seq(1, 100)].
>
> This is what the network reports on node-a:
>
> dstat -n 1 10
> -net/total-
>  recv  send
>    0     0
>  676k  756M
>  643k  767M
>  584k  679M
>  693k  777M
>  648k  745M
>  660k  745M
>  667k  772M
>  651k  709M
>  675k  782M
>  688k  819M
>
> That roughly translates to 6 Gbit/s. In other words, the Erlang
> distribution link between node-a & node-b is maxing out at around ~6
> Gbit/s. Erlang distribution is limited to 27% of what we are measuring
> consistently and repeatedly outside of Erlang. In other words, iperf is
> 3.6x faster than an Erlang distribution link. It gets better.
>
> If we start another 100 processes pumping 10Mbyte messages from node-a to
> node-c, we see the network throughput double:
>
> dstat -n 1 10
> -net/total-
>  recv  send
>    0     0
> 1303k 1463M
> 1248k 1360M
> 1332k 1458M
> 1480k 1569M
> 1339k 1455M
> 1413k 1494M
> 1395k 1431M
> 1359k 1514M
> 1438k 1564M
> 1379k 1489M
>
> So 2 distribution links - each to a separate node - utilise 12Gbit/s out
> of the 22Gbit/s available on node-a.
>
> What is preventing the distribution links pushing more data through? There
> is plenty of CPU & memory available (all nodes have 16 CPUs & 104GB MEM -
> n1-highmem-16):
>
> dstat -cm 1 10
> ----total-cpu-usage---- ------memory-usage-----
> usr sys idl wai hiq siq| used  buff  cach  free
>  10   6  84   0   0   1|16.3G  118M  284M 85.6G
>  20   6  73   0   0   1|16.3G  118M  284M 85.6G
>  20   6  74   0   0   0|16.3G  118M  284M 85.6G
>  18   6  76   0   0   0|16.4G  118M  284M 85.5G
>  19   6  74   0   0   1|16.4G  118M  284M 85.4G
>  17   4  78   0   0   0|16.5G  118M  284M 85.4G
>  20   6  74   0   0   0|16.5G  118M  284M 85.4G
>  19   6  74   0   0   0|16.5G  118M  284M 85.4G
>  19   5  76   0   0   1|16.5G  118M  284M 85.4G
>  18   6  75   0   0   0|16.5G  118M  284M 85.4G
>  18   6  75   0   0   0|16.6G  118M  284M 85.3G
>
> The only smoking gun is the distribution output queue buffer:
> https://grafana.gcp.rabbitmq.com/dashboard/snapshot/H329EfN3SFhsveA20ei7jC7JMFHAm8Ru?orgId=1&fullscreen&panelId=62
>
> Speaking of which, we look forward to erlang/otp#2270 being merged:
> https://github.com/erlang/otp/pull/2270
>
> All distribution metrics are available here:
> https://grafana.gcp.rabbitmq.com/dashboard/snapshot/H329EfN3SFhsveA20ei7jC7JMFHAm8Ru?orgId=1
>
> If you want to see the state of distribution links & dist process state
> (they are all green btw), check the point-in-time metrics (they will expire
> in 15 days from today):
> https://grafana.gcp.rabbitmq.com/d/d-SFCCmZz/erlang-distribution?from=1560775955127&to=1560779424482
>
> How can we tell what is preventing the distribution link from using all
> available bandwidth?
>
> Are we missing a configuration flag? These are all the relevant beam.smp
> flags that we are using:
> https://github.com/erlang/otp/pull/2270#issuecomment-500953352
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20190617/6c9b409c/attachment.htm>