[erlang-questions] Erlang distribution links don't fully utilise available resources - OTP 22.0.2 - Why?

Tue Jun 18 15:27:57 CEST 2019

You made me curious so I did some additional optimizations:
https://github.com/erlang/otp/pull/2291

They added an additional 20-25% for me.

On Tue, Jun 18, 2019 at 2:05 PM Gerhard Lazu <gerhard@REDACTED> wrote:

> *TL;DR Single distribution link was measured to peak at 14.4 Gbit/s, which
> is ~50% from network maximum, 28.5 Gbit/s.*
>
> I'm confirming Lukas' observations: using smaller payload sizes can result
> in up to 14.4 Gbit/s throughput per distribution link.
>
> Payloads of both 60KB & 120KB yield the same max throughput of 14.4 Gbit/s.
>
> When message payloads go from 1MB to 4MB, throughput drops from 12.8
> Gbit/s to 9.6 Gbit/s.
>
> With 10MB payloads, distribution link maxes out at 7.6 Gbit/s. This is
> interesting because yesterday I couldn't get it above 6 Gbit/s.
> Since this is GCP, on paper each CPU gets 2 Gbit/s. At 16 CPUs we should
> be maxing out at 32 Gbit/s.
> I have just benchmarked with iperf, and today these fresh VMs are peaking
> at 28.5 Gbit/s (yesterday they were 22.5 Gbit/s).
>
> At 100MB payloads, distribution link makes out at 4.5 Gbit/s.
>
> All distribution-related metrics, including annotations for various
> message payloads can be found here:
> https://grafana.gcp.rabbitmq.com/dashboard/snapshot/SDr2EWgaD2KOX5i0H154UvzoHRS68FV0?orgId=1
>
> On Tue, Jun 18, 2019 at 11:07 AM Lukas Larsson <lukas@REDACTED> wrote:
>
>>
>>
>> On Mon, Jun 17, 2019 at 4:49 PM Gerhard Lazu <gerhard@REDACTED> wrote:
>>
>>>
>>> B = fun F() -> rpc:cast('foo@REDACTED', erlang, is_binary, [<<0:10000000/unit:8>>]), F() end.
>>> [spawn(fun() -> B() end) || _ <- lists:seq(1, 100)].
>>>
>>> I wrote this code that is better able to saturate the network. Using rpc
>> will create bottlenecks on the receiving side as it is the receiving
>> process that does the decode.  I get about 20% more traffic through when I
>> do that, which is not as much as I was expecting.
>>
>> -module(dist_perf).
>>
>> -export([go/0,go/1]).
>>
>> go() ->
>>     go(100).
>> go(N) ->
>>     spawn_link(fun stats/0),
>>     RemoteNode = 'bar@REDACTED',
>>     Pids = [spawn(RemoteNode, fun F() -> receive {From, Msg} -> From !
>> is_binary(Msg), F() end end)
>>             || _<- lists:seq(1,N)],
>>     Payload = <<0:10000000/unit:8>>,
>>     B = fun F(Pid) -> Pid ! {self(),Payload}, receive true -> F(Pid) end
>> end,
>>     [spawn(fun() -> B(Pid) end) || Pid <- Pids ].
>>
>>
>> stats() ->
>>     {{_input,T0Inp},{_output,T0Out}} = erlang:statistics(io),
>>     timer:sleep(5000),
>>     {{_input,T1Inp},{_output,T1Out}} = erlang:statistics(io),
>>     io:format("Sent ~pMB/s, Recv ~pMB/s~n",
>>               [(T1Out - T0Out) / 1024 / 1024 / 5,
>>                (T1Inp - T0Inp) / 1024 / 1024 / 5]),
>>     stats().
>>
>> I fired up linux perf to have a look at what was taking time and it is
>> (as Dmytro said) the copying of data that takes time. There is one specific
>> places that you hit with your test very badly:
>> https://github.com/erlang/otp/blob/master/erts/emulator/beam/dist.c#L1565
>>
>> This is the re-assembly of fragmented distribution messages. This happens
>> because the messages sent are > 64kb. I change that limit to be 64MB on my
>> machine that that doubled the throughput (
>> https://github.com/erlang/otp/blob/master/erts/emulator/beam/dist.h#L150
>> ).
>>
>> The reason why the re-assembly has to be done is that the emulator does
>> not have a variant of erlang:binary_to_term/1 that takes an iovector as
>> input. This could definitely be fixed and is something that we are planning
>> on fixing, though it is not something we are working on right now. However,
>> it will only change things in production if you send > 64KB messages.
>>
>> Also, just to be clear, the copying that is done by the distribution
>> layer is:
>>
>> 1) use term_to_binary to create a buffer
>> 2) use writev in inet_driver to copy buffer to kernel
>> 3) use recv to copy data from kernel into a refc binary
>> 4) if needed, use memcpy re-assemble fragments
>> 5) use binary_to_term to decode term
>>
>> (there is no copy to the destination process as Dmyto said)
>>
>> Before OTP-22 there was an additional copy at the receiving side, but
>> that has been removed. So imo there is only one copy that could potentially
>> be removed.
>>
>> If you want to decrease the overhead of saturating the network with many
>> small messages I would implement something like read_packets for tcp in the
>> inet_driver. However, the real bottleneck quickly becomes memory
>> allocations as each message needs two or three allocations, and when you
>> receive a lot of messages those allocations add up.
>>
>> Lukas
>>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20190618/20e970b4/attachment.htm>