[erlang-questions] Erlang distribution links don't fully utilise available resources - OTP 22.0.2 - Why?

Tue Jun 18 12:07:02 CEST 2019

On Mon, Jun 17, 2019 at 4:49 PM Gerhard Lazu <gerhard@REDACTED> wrote:

>
> B = fun F() -> rpc:cast('foo@REDACTED', erlang, is_binary, [<<0:10000000/unit:8>>]), F() end.
> [spawn(fun() -> B() end) || _ <- lists:seq(1, 100)].
>
> I wrote this code that is better able to saturate the network. Using rpc
will create bottlenecks on the receiving side as it is the receiving
process that does the decode.  I get about 20% more traffic through when I
do that, which is not as much as I was expecting.

-module(dist_perf).

-export([go/0,go/1]).

go() ->
    go(100).
go(N) ->
    spawn_link(fun stats/0),
    RemoteNode = 'bar@REDACTED',
    Pids = [spawn(RemoteNode, fun F() -> receive {From, Msg} -> From !
is_binary(Msg), F() end end)
            || _<- lists:seq(1,N)],
    Payload = <<0:10000000/unit:8>>,
    B = fun F(Pid) -> Pid ! {self(),Payload}, receive true -> F(Pid) end
end,
    [spawn(fun() -> B(Pid) end) || Pid <- Pids ].

stats() ->
    {{_input,T0Inp},{_output,T0Out}} = erlang:statistics(io),
    timer:sleep(5000),
    {{_input,T1Inp},{_output,T1Out}} = erlang:statistics(io),
    io:format("Sent ~pMB/s, Recv ~pMB/s~n",
              [(T1Out - T0Out) / 1024 / 1024 / 5,
               (T1Inp - T0Inp) / 1024 / 1024 / 5]),
    stats().

I fired up linux perf to have a look at what was taking time and it is (as
Dmytro said) the copying of data that takes time. There is one specific
places that you hit with your test very badly:
https://github.com/erlang/otp/blob/master/erts/emulator/beam/dist.c#L1565

This is the re-assembly of fragmented distribution messages. This happens
because the messages sent are > 64kb. I change that limit to be 64MB on my
machine that that doubled the throughput (
https://github.com/erlang/otp/blob/master/erts/emulator/beam/dist.h#L150).

The reason why the re-assembly has to be done is that the emulator does not
have a variant of erlang:binary_to_term/1 that takes an iovector as input.
This could definitely be fixed and is something that we are planning on
fixing, though it is not something we are working on right now. However, it
will only change things in production if you send > 64KB messages.

Also, just to be clear, the copying that is done by the distribution layer
is:

1) use term_to_binary to create a buffer
2) use writev in inet_driver to copy buffer to kernel
3) use recv to copy data from kernel into a refc binary
4) if needed, use memcpy re-assemble fragments
5) use binary_to_term to decode term

(there is no copy to the destination process as Dmyto said)

Before OTP-22 there was an additional copy at the receiving side, but that
has been removed. So imo there is only one copy that could potentially be
removed.

If you want to decrease the overhead of saturating the network with many
small messages I would implement something like read_packets for tcp in the
inet_driver. However, the real bottleneck quickly becomes memory
allocations as each message needs two or three allocations, and when you
receive a lot of messages those allocations add up.

Lukas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20190618/b4cb87ba/attachment.htm>