<div dir="ltr"><div><b>TL;DR Single distribution link was measured to peak at 14.4 Gbit/s, which is ~50% from network maximum, 28.5 Gbit/s.</b></div><div><br></div>I'm confirming Lukas' observations: using smaller payload sizes can result in up to 14.4 Gbit/s throughput per distribution link.<div><br></div><div>Payloads of both 60KB & 120KB yield the same max throughput of 14.4 Gbit/s.</div><div><br></div><div>When message payloads go from 1MB to 4MB, throughput drops from 12.8 Gbit/s to 9.6 Gbit/s.</div><div><br></div><div>With 10MB payloads, distribution link maxes out at 7.6 Gbit/s. This is interesting because yesterday I couldn't get it above 6 Gbit/s.</div><div>Since this is GCP, on paper each CPU gets 2 Gbit/s. At 16 CPUs we should be maxing out at 32 Gbit/s.</div><div>I have just benchmarked with iperf, and today these fresh VMs are peaking at 28.5 Gbit/s (yesterday they were 22.5 Gbit/s).</div><div><br></div><div>At 100MB payloads, distribution link makes out at 4.5 Gbit/s.<br><div><br></div><div>All distribution-related metrics, including annotations for various message payloads can be found here: <a href="https://grafana.gcp.rabbitmq.com/dashboard/snapshot/SDr2EWgaD2KOX5i0H154UvzoHRS68FV0?orgId=1">https://grafana.gcp.rabbitmq.com/dashboard/snapshot/SDr2EWgaD2KOX5i0H154UvzoHRS68FV0?orgId=1</a></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Jun 18, 2019 at 11:07 AM Lukas Larsson <<a href="mailto:lukas@erlang.org">lukas@erlang.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Jun 17, 2019 at 4:49 PM Gerhard Lazu <<a href="mailto:gerhard@lazu.co.uk" target="_blank">gerhard@lazu.co.uk</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><p style="box-sizing:border-box;margin-bottom:16px;color:rgb(36,41,46);font-family:-apple-system,BlinkMacSystemFont,"Segoe UI",Helvetica,Arial,sans-serif,"Apple Color Emoji","Segoe UI Emoji","Segoe UI Symbol";font-size:16px;margin-top:0px"><br></p><pre style="box-sizing:border-box;font-family:SFMono-Regular,Consolas,"Liberation Mono",Menlo,Courier,monospace;font-size:13.6px;margin-top:0px;margin-bottom:16px;padding:16px;overflow:auto;line-height:1.45;background-color:rgb(246,248,250);border-radius:3px;color:rgb(36,41,46)"><code style="box-sizing:border-box;font-family:SFMono-Regular,Consolas,"Liberation Mono",Menlo,Courier,monospace;padding:0px;margin:0px;background-color:transparent;border-radius:3px;word-break:normal;border:0px;display:inline;overflow:visible;line-height:inherit">B = fun F() -> rpc:cast('foo@node-b', erlang, is_binary, [<<0:10000000/unit:8>>]), F() end.
[spawn(fun() -> B() end) || _ <- lists:seq(1, 100)].
</code></pre><p style="box-sizing:border-box;margin-top:0px;margin-bottom:16px;color:rgb(36,41,46);font-family:-apple-system,BlinkMacSystemFont,"Segoe UI",Helvetica,Arial,sans-serif,"Apple Color Emoji","Segoe UI Emoji","Segoe UI Symbol";font-size:16px"></p></div></blockquote><div><font face="arial, sans-serif">I wrote this code that is better able to saturate the network. Using rpc will create bottlenecks on the receiving side as it is the receiving process that does the decode. I get about 20% more traffic through when I do that, which is not as much as I was expecting.</font></div><div><font face="arial, sans-serif"><br></font></div><div><font face="courier new, monospace">-module(dist_perf).<br><br>-export([go/0,go/1]).<br><br>go() -><br> go(100).<br>go(N) -><br> spawn_link(fun stats/0),<br> RemoteNode = 'bar@elxd3291v0k',<br> Pids = [spawn(RemoteNode, fun F() -> receive {From, Msg} -> From ! is_binary(Msg), F() end end)<br> || _<- lists:seq(1,N)],<br> Payload = <<0:10000000/unit:8>>,<br> B = fun F(Pid) -> Pid ! {self(),Payload}, receive true -> F(Pid) end end,<br> [spawn(fun() -> B(Pid) end) || Pid <- Pids ].<br><br><br>stats() -><br> {{_input,T0Inp},{_output,T0Out}} = erlang:statistics(io),<br> timer:sleep(5000),<br> {{_input,T1Inp},{_output,T1Out}} = erlang:statistics(io),<br> io:format("Sent ~pMB/s, Recv ~pMB/s~n",<br> [(T1Out - T0Out) / 1024 / 1024 / 5,<br> (T1Inp - T0Inp) / 1024 / 1024 / 5]),<br> stats().<br></font> </div><div>I fired up linux perf to have a look at what was taking time and it is (as Dmytro said) the copying of data that takes time. There is one specific places that you hit with your test very badly: <a href="https://github.com/erlang/otp/blob/master/erts/emulator/beam/dist.c#L1565" target="_blank">https://github.com/erlang/otp/blob/master/erts/emulator/beam/dist.c#L1565</a></div><div><br></div><div>This is the re-assembly of fragmented distribution messages. This happens because the messages sent are > 64kb. I change that limit to be 64MB on my machine that that doubled the throughput (<a href="https://github.com/erlang/otp/blob/master/erts/emulator/beam/dist.h#L150" target="_blank">https://github.com/erlang/otp/blob/master/erts/emulator/beam/dist.h#L150</a>).</div><div><br></div><div>The reason why the re-assembly has to be done is that the emulator does not have a variant of erlang:binary_to_term/1 that takes an iovector as input. This could definitely be fixed and is something that we are planning on fixing, though it is not something we are working on right now. However, it will only change things in production if you send > 64KB messages.</div><div><br></div><div>Also, just to be clear, the copying that is done by the distribution layer is:</div><div><br></div><div>1) use term_to_binary to create a buffer</div><div>2) use writev in inet_driver to copy buffer to kernel</div><div>3) use recv to copy data from kernel into a refc binary</div><div>4) if needed, use memcpy re-assemble fragments</div><div>5) use binary_to_term to decode term</div><div><br></div><div>(there is no copy to the destination process as Dmyto said)</div><div><br></div><div>Before OTP-22 there was an additional copy at the receiving side, but that has been removed. So imo there is only one copy that could potentially be removed.</div><div><br></div><div>If you want to decrease the overhead of saturating the network with many small messages I would implement something like read_packets for tcp in the inet_driver. However, the real bottleneck quickly becomes memory allocations as each message needs two or three allocations, and when you receive a lot of messages those allocations add up.</div><div><br></div><div>Lukas</div></div></div>
</blockquote></div>