[erlang-questions] Inter-node communication bottleneck

Dror Mein drormein@REDACTED
Mon Sep 8 18:11:17 CEST 2014


Do you any advice for those who haven't migrated to 17 yet for the receive side? 
Has anyone tried to hack out the synchronous troubles of gen_tcp:send? 
I can always aggregate messages to achieve higher throughput, but why not add a asynchronous gen_tcp:send option?


On Monday, September 8, 2014 6:51 PM, Jesper Louis Andersen <jesper.louis.andersen@REDACTED> wrote:
 




On Mon, Sep 8, 2014 at 2:12 PM, Jihyun Yu <yjh0502@REDACTED> wrote:

I attached test source code so you can reproduce the result. Please tell
>me if there is an error on configurations/test codes/...
So I toyed around with this example for a while. My changes are here:

https://gist.github.com/jlouis/0cbdd8581fc0651827d0

Test machine is a fairly old laptop:

[jlouis@REDACTED ~/test_tcp]$ uname -a
FreeBSD dragon.lan 10.0-RELEASE-p7 FreeBSD 10.0-RELEASE-p7 #0: Tue Jul  8 06:37:44 UTC 2014     root@REDACTED:/usr/obj/usr/src/sys/GENERIC  amd64
[jlouis@REDACTED ~]$ sysctl hw.model
hw.model: Intel(R) Core(TM)2 Duo CPU     P8600  @ 2.40GHz

All measurements are happening in both directions. We are sending bits to the kernel and receiving bits from the kernel as well.

The base rate of this system was around 23 megabit per second running 4 sender processes and 4 receiver processes. Adding [{delay_send, true}] immediately sent this to 31 megabit per second, which kind of hints what is going in. This is not a bandwidth problem, it is a problem of latency and synchronous communication. Utilizing the {active, N} feature in 17+ by Steve Vinoski, removes the synchronicity bottleneck in the receiver direction. Eprof shows that CPU utilization falls from 10% per process to 1.5% on this machine. And then we run at 58 megabit.

The reason we don't run any faster is due to the send path. A gen_tcp:send/2 only continues when the socket responds back that the message was sent with success. Since we only have one process per core, we end up dying of messaging overhead due to the messages being small and the concurrency of the system being bad. You can hack yourself out of this one with a bit of trickery and port_command/3 but I am not sure it is worth it. I also suspect this is why it doesn't help with a higher watermark. Your 4/12 processes are waiting for the underlying layer to send out data before it will send off the next piece of data to the underlying socket. Then the kernel gets to work, and gives the data to the receivers which then gets to consume it. At no point is the TCP send buffers filled up, really.

To play the bandwidth game, you need to saturate your outgoing TCP buffers, so when the kernel goes to work, it has a lot of stuff to work with.

What you are seeing is a common symptom: you are trading off latency, bandwidth utilization and correctness for each other. For messages of this small size and no other processing, you are essentially measuring a code path which includes a lot of context switches: between erlang processes and back'n'forth to the kernel. Since the concurrency of the system is fairly low (P is small), and we have a tight sending loop, you are going to lose with Erlang, every time. In a large system, the overhead you are seeing is fairly constant and thus it becomes irrelevant to the measurement.

If we change send_n to contain bigger terms[0]:

send_n(_, 0) -> ok;
send_n(Socket, N) ->
    N1 = {N, N},
    N2 = {N1, N1},
    N3 = {N2, N2},
    N4 = {N3, N3},
    N5 = {N4, N4},
    N6 = {N5, N5},
    N7 = {N6, N6},
    gen_tcp:send(Socket, term_to_binary(N7)),
    send_n(Socket, N-1).

Then we hit 394 megabit on this machine. Furthermore, we can't even maximize the two CPU cores anymore as they are only running at 50% utilization. So now we are hitting the OS bottlenecks instead, which you have to tune for otherwise. In this example, we avoid the synchronicity of the gen_tcp:send/2 path since we are sending more work to the underlying system. You can probably run faster, but then you need to tune the TCP socket options as they are not made for gigabit speed operation by default.

In order to figure all this out, I just ran

eprof:profile(fun() -> tcp_test:run_tcp(2000, 4, 1000*10) end).
eprof:log("foo.txt").
eprof:analyze().

and went to work by analyzing the profile output.

Docker or not, I believe there are other factors at play here...
[0] Small nice persistence trick used here: building a tree of exponential size in linear time.

-- 
J. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20140908/a025721d/attachment.htm>


More information about the erlang-questions mailing list