[erlang-questions] Inter-node communication bottleneck

Mon Sep 8 23:46:16 CEST 2014

Hi,

Summerizing benchmark result in terms of bandwidth is misleading and my
mistake. I wanted to talk about overhead of small messages, and how many
inter-node messages Erlang can handle.

My initial question was about bottleneck on inter-node messaging [1],
and with some tuning, throughput could be increased about 0.5M messages
per second. I thought that the problem is on using single TCP
connections to handle all messages from multiple processes, and
throughput (# of msgs/sec) might be increased with multiple TCP
connections [2].

It seems that Erlang messaging with multiple nodes does not scale with
number of CPUs, even with multiple TCP connections. For example in my
test environment, P=2 shows 1.5M msgs/sec with ~18% usr+sys CPU, and
P=12 shows 1.8M msgs/sec with ~90% usr+sys CPU. It seems that Erlang VM
cannot handle more than 2M messages without explicit tuning like message
batching.

2M msgs/sec, or even 0.5M msgs/sec might enough for most use cases, so
it might not a problem.

[1] http://erlang.org/pipermail/erlang-questions/2014-August/080598.html
[2] http://erlang.org/pipermail/erlang-questions/2014-August/080643.html

On Mon, Sep 08, 2014 at 05:51:24PM +0200, Jesper Louis Andersen wrote:
> On Mon, Sep 8, 2014 at 2:12 PM, Jihyun Yu <yjh0502@REDACTED> wrote:
> 
> > I attached test source code so you can reproduce the result. Please tell
> > me if there is an error on configurations/test codes/...
> >
> 
> So I toyed around with this example for a while. My changes are here:
> 
> https://gist.github.com/jlouis/0cbdd8581fc0651827d0
> 
> Test machine is a fairly old laptop:
> 
> [jlouis@REDACTED ~/test_tcp]$ uname -a
> FreeBSD dragon.lan 10.0-RELEASE-p7 FreeBSD 10.0-RELEASE-p7 #0: Tue Jul  8
> 06:37:44 UTC 2014
> root@REDACTED:/usr/obj/usr/src/sys/GENERIC
>  amd64
> [jlouis@REDACTED ~]$ sysctl hw.model
> hw.model: Intel(R) Core(TM)2 Duo CPU     P8600  @ 2.40GHz
> 
> All measurements are happening in both directions. We are sending bits to
> the kernel and receiving bits from the kernel as well.
> 
> The base rate of this system was around 23 megabit per second running 4
> sender processes and 4 receiver processes. Adding [{delay_send, true}]
> immediately sent this to 31 megabit per second, which kind of hints what is
> going in. This is not a bandwidth problem, it is a problem of latency and
> synchronous communication. Utilizing the {active, N} feature in 17+ by
> Steve Vinoski, removes the synchronicity bottleneck in the receiver
> direction. Eprof shows that CPU utilization falls from 10% per process to
> 1.5% on this machine. And then we run at 58 megabit.
> 
> The reason we don't run any faster is due to the send path. A
> gen_tcp:send/2 only continues when the socket responds back that the
> message was sent with success. Since we only have one process per core, we
> end up dying of messaging overhead due to the messages being small and the
> concurrency of the system being bad. You can hack yourself out of this one
> with a bit of trickery and port_command/3 but I am not sure it is worth it.
> I also suspect this is why it doesn't help with a higher watermark. Your
> 4/12 processes are waiting for the underlying layer to send out data before
> it will send off the next piece of data to the underlying socket. Then the
> kernel gets to work, and gives the data to the receivers which then gets to
> consume it. At no point is the TCP send buffers filled up, really.
> 
> To play the bandwidth game, you need to saturate your outgoing TCP buffers,
> so when the kernel goes to work, it has a lot of stuff to work with.
> 
> What you are seeing is a common symptom: you are trading off latency,
> bandwidth utilization and correctness for each other. For messages of this
> small size and no other processing, you are essentially measuring a code
> path which includes a lot of context switches: between erlang processes and
> back'n'forth to the kernel. Since the concurrency of the system is fairly
> low (P is small), and we have a tight sending loop, you are going to lose
> with Erlang, every time. In a large system, the overhead you are seeing is
> fairly constant and thus it becomes irrelevant to the measurement.
> 
> If we change send_n to contain bigger terms[0]:
> 
> send_n(_, 0) -> ok;
> send_n(Socket, N) ->
>     N1 = {N, N},
>     N2 = {N1, N1},
>     N3 = {N2, N2},
>     N4 = {N3, N3},
>     N5 = {N4, N4},
>     N6 = {N5, N5},
>     N7 = {N6, N6},
>     gen_tcp:send(Socket, term_to_binary(N7)),
>     send_n(Socket, N-1).
> 
> Then we hit 394 megabit on this machine. Furthermore, we can't even
> maximize the two CPU cores anymore as they are only running at 50%
> utilization. So now we are hitting the OS bottlenecks instead, which you
> have to tune for otherwise. In this example, we avoid the synchronicity of
> the gen_tcp:send/2 path since we are sending more work to the underlying
> system. You can probably run faster, but then you need to tune the TCP
> socket options as they are not made for gigabit speed operation by default.
> 
> In order to figure all this out, I just ran
> 
> eprof:profile(fun() -> tcp_test:run_tcp(2000, 4, 1000*10) end).
> eprof:log("foo.txt").
> eprof:analyze().
> 
> and went to work by analyzing the profile output.
> 
> Docker or not, I believe there are other factors at play here...
> 
> [0] Small nice persistence trick used here: building a tree of exponential
> size in linear time.
> 
> -- 
> J.