[erlang-questions] Erlang distribution links don't fully utilise available resources - OTP 22.0.2 - Why?

Tue Jun 18 14:04:54 CEST 2019

*TL;DR Single distribution link was measured to peak at 14.4 Gbit/s, which
is ~50% from network maximum, 28.5 Gbit/s.*

I'm confirming Lukas' observations: using smaller payload sizes can result
in up to 14.4 Gbit/s throughput per distribution link.

Payloads of both 60KB & 120KB yield the same max throughput of 14.4 Gbit/s.

When message payloads go from 1MB to 4MB, throughput drops from 12.8 Gbit/s
to 9.6 Gbit/s.

With 10MB payloads, distribution link maxes out at 7.6 Gbit/s. This is
interesting because yesterday I couldn't get it above 6 Gbit/s.
Since this is GCP, on paper each CPU gets 2 Gbit/s. At 16 CPUs we should be
maxing out at 32 Gbit/s.
I have just benchmarked with iperf, and today these fresh VMs are peaking
at 28.5 Gbit/s (yesterday they were 22.5 Gbit/s).

At 100MB payloads, distribution link makes out at 4.5 Gbit/s.

All distribution-related metrics, including annotations for various message
payloads can be found here:
https://grafana.gcp.rabbitmq.com/dashboard/snapshot/SDr2EWgaD2KOX5i0H154UvzoHRS68FV0?orgId=1

On Tue, Jun 18, 2019 at 11:07 AM Lukas Larsson <lukas@REDACTED> wrote:

>
>
> On Mon, Jun 17, 2019 at 4:49 PM Gerhard Lazu <gerhard@REDACTED> wrote:
>
>>
>> B = fun F() -> rpc:cast('foo@REDACTED', erlang, is_binary, [<<0:10000000/unit:8>>]), F() end.
>> [spawn(fun() -> B() end) || _ <- lists:seq(1, 100)].
>>
>> I wrote this code that is better able to saturate the network. Using rpc
> will create bottlenecks on the receiving side as it is the receiving
> process that does the decode.  I get about 20% more traffic through when I
> do that, which is not as much as I was expecting.
>
> -module(dist_perf).
>
> -export([go/0,go/1]).
>
> go() ->
>     go(100).
> go(N) ->
>     spawn_link(fun stats/0),
>     RemoteNode = 'bar@REDACTED',
>     Pids = [spawn(RemoteNode, fun F() -> receive {From, Msg} -> From !
> is_binary(Msg), F() end end)
>             || _<- lists:seq(1,N)],
>     Payload = <<0:10000000/unit:8>>,
>     B = fun F(Pid) -> Pid ! {self(),Payload}, receive true -> F(Pid) end
> end,
>     [spawn(fun() -> B(Pid) end) || Pid <- Pids ].
>
>
> stats() ->
>     {{_input,T0Inp},{_output,T0Out}} = erlang:statistics(io),
>     timer:sleep(5000),
>     {{_input,T1Inp},{_output,T1Out}} = erlang:statistics(io),
>     io:format("Sent ~pMB/s, Recv ~pMB/s~n",
>               [(T1Out - T0Out) / 1024 / 1024 / 5,
>                (T1Inp - T0Inp) / 1024 / 1024 / 5]),
>     stats().
>
> I fired up linux perf to have a look at what was taking time and it is (as
> Dmytro said) the copying of data that takes time. There is one specific
> places that you hit with your test very badly:
> https://github.com/erlang/otp/blob/master/erts/emulator/beam/dist.c#L1565
>
> This is the re-assembly of fragmented distribution messages. This happens
> because the messages sent are > 64kb. I change that limit to be 64MB on my
> machine that that doubled the throughput (
> https://github.com/erlang/otp/blob/master/erts/emulator/beam/dist.h#L150).
>
> The reason why the re-assembly has to be done is that the emulator does
> not have a variant of erlang:binary_to_term/1 that takes an iovector as
> input. This could definitely be fixed and is something that we are planning
> on fixing, though it is not something we are working on right now. However,
> it will only change things in production if you send > 64KB messages.
>
> Also, just to be clear, the copying that is done by the distribution layer
> is:
>
> 1) use term_to_binary to create a buffer
> 2) use writev in inet_driver to copy buffer to kernel
> 3) use recv to copy data from kernel into a refc binary
> 4) if needed, use memcpy re-assemble fragments
> 5) use binary_to_term to decode term
>
> (there is no copy to the destination process as Dmyto said)
>
> Before OTP-22 there was an additional copy at the receiving side, but that
> has been removed. So imo there is only one copy that could potentially be
> removed.
>
> If you want to decrease the overhead of saturating the network with many
> small messages I would implement something like read_packets for tcp in the
> inet_driver. However, the real bottleneck quickly becomes memory
> allocations as each message needs two or three allocations, and when you
> receive a lot of messages those allocations add up.
>
> Lukas
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20190618/7b83d585/attachment.htm>