[erlang-questions] Inter-node communication bottleneck

Jesper Louis Andersen jesper.louis.andersen@REDACTED
Mon Sep 8 17:51:24 CEST 2014


On Mon, Sep 8, 2014 at 2:12 PM, Jihyun Yu <yjh0502@REDACTED> wrote:

> I attached test source code so you can reproduce the result. Please tell
> me if there is an error on configurations/test codes/...
>

So I toyed around with this example for a while. My changes are here:

https://gist.github.com/jlouis/0cbdd8581fc0651827d0

Test machine is a fairly old laptop:

[jlouis@REDACTED ~/test_tcp]$ uname -a
FreeBSD dragon.lan 10.0-RELEASE-p7 FreeBSD 10.0-RELEASE-p7 #0: Tue Jul  8
06:37:44 UTC 2014
root@REDACTED:/usr/obj/usr/src/sys/GENERIC
 amd64
[jlouis@REDACTED ~]$ sysctl hw.model
hw.model: Intel(R) Core(TM)2 Duo CPU     P8600  @ 2.40GHz

All measurements are happening in both directions. We are sending bits to
the kernel and receiving bits from the kernel as well.

The base rate of this system was around 23 megabit per second running 4
sender processes and 4 receiver processes. Adding [{delay_send, true}]
immediately sent this to 31 megabit per second, which kind of hints what is
going in. This is not a bandwidth problem, it is a problem of latency and
synchronous communication. Utilizing the {active, N} feature in 17+ by
Steve Vinoski, removes the synchronicity bottleneck in the receiver
direction. Eprof shows that CPU utilization falls from 10% per process to
1.5% on this machine. And then we run at 58 megabit.

The reason we don't run any faster is due to the send path. A
gen_tcp:send/2 only continues when the socket responds back that the
message was sent with success. Since we only have one process per core, we
end up dying of messaging overhead due to the messages being small and the
concurrency of the system being bad. You can hack yourself out of this one
with a bit of trickery and port_command/3 but I am not sure it is worth it.
I also suspect this is why it doesn't help with a higher watermark. Your
4/12 processes are waiting for the underlying layer to send out data before
it will send off the next piece of data to the underlying socket. Then the
kernel gets to work, and gives the data to the receivers which then gets to
consume it. At no point is the TCP send buffers filled up, really.

To play the bandwidth game, you need to saturate your outgoing TCP buffers,
so when the kernel goes to work, it has a lot of stuff to work with.

What you are seeing is a common symptom: you are trading off latency,
bandwidth utilization and correctness for each other. For messages of this
small size and no other processing, you are essentially measuring a code
path which includes a lot of context switches: between erlang processes and
back'n'forth to the kernel. Since the concurrency of the system is fairly
low (P is small), and we have a tight sending loop, you are going to lose
with Erlang, every time. In a large system, the overhead you are seeing is
fairly constant and thus it becomes irrelevant to the measurement.

If we change send_n to contain bigger terms[0]:

send_n(_, 0) -> ok;
send_n(Socket, N) ->
    N1 = {N, N},
    N2 = {N1, N1},
    N3 = {N2, N2},
    N4 = {N3, N3},
    N5 = {N4, N4},
    N6 = {N5, N5},
    N7 = {N6, N6},
    gen_tcp:send(Socket, term_to_binary(N7)),
    send_n(Socket, N-1).

Then we hit 394 megabit on this machine. Furthermore, we can't even
maximize the two CPU cores anymore as they are only running at 50%
utilization. So now we are hitting the OS bottlenecks instead, which you
have to tune for otherwise. In this example, we avoid the synchronicity of
the gen_tcp:send/2 path since we are sending more work to the underlying
system. You can probably run faster, but then you need to tune the TCP
socket options as they are not made for gigabit speed operation by default.

In order to figure all this out, I just ran

eprof:profile(fun() -> tcp_test:run_tcp(2000, 4, 1000*10) end).
eprof:log("foo.txt").
eprof:analyze().

and went to work by analyzing the profile output.

Docker or not, I believe there are other factors at play here...

[0] Small nice persistence trick used here: building a tree of exponential
size in linear time.

-- 
J.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20140908/a42e94e0/attachment.htm>


More information about the erlang-questions mailing list