<div dir="ltr"><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Sep 8, 2014 at 2:12 PM, Jihyun Yu <span dir="ltr"><<a href="mailto:yjh0502@gmail.com" target="_blank">yjh0502@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div id=":13i" class="" style="overflow:hidden">I attached <span class="">test</span> source code so you can reproduce the result. Please tell<br>
me if there is an error on configurations/<span class="">test</span> codes/...</div></blockquote></div><br>So I toyed around with this example for a while. My changes are here:</div><div class="gmail_extra"><br></div><div class="gmail_extra"><a href="https://gist.github.com/jlouis/0cbdd8581fc0651827d0">https://gist.github.com/jlouis/0cbdd8581fc0651827d0</a></div><div class="gmail_extra"><br></div><div class="gmail_extra">Test machine is a fairly old laptop:</div><div class="gmail_extra"><br></div><div class="gmail_extra"><div class="gmail_extra">[jlouis@dragon ~/test_tcp]$ uname -a</div><div class="gmail_extra">FreeBSD dragon.lan 10.0-RELEASE-p7 FreeBSD 10.0-RELEASE-p7 #0: Tue Jul 8 06:37:44 UTC 2014 root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64</div><div><div>[jlouis@dragon ~]$ sysctl hw.model</div><div>hw.model: Intel(R) Core(TM)2 Duo CPU P8600 @ 2.40GHz</div></div><div><br></div><div>All measurements are happening in both directions. We are sending bits to the kernel and receiving bits from the kernel as well.</div><div><br></div><div>The base rate of this system was around 23 megabit per second running 4 sender processes and 4 receiver processes. Adding [{delay_send, true}] immediately sent this to 31 megabit per second, which kind of hints what is going in. This is not a bandwidth problem, it is a problem of latency and synchronous communication. Utilizing the {active, N} feature in 17+ by Steve Vinoski, removes the synchronicity bottleneck in the receiver direction. Eprof shows that CPU utilization falls from 10% per process to 1.5% on this machine. And then we run at 58 megabit.</div><div><br></div><div>The reason we don't run any faster is due to the send path. A gen_tcp:send/2 only continues when the socket responds back that the message was sent with success. Since we only have one process per core, we end up dying of messaging overhead due to the messages being small and the concurrency of the system being bad. You can hack yourself out of this one with a bit of trickery and port_command/3 but I am not sure it is worth it. I also suspect this is why it doesn't help with a higher watermark. Your 4/12 processes are waiting for the underlying layer to send out data before it will send off the next piece of data to the underlying socket. Then the kernel gets to work, and gives the data to the receivers which then gets to consume it. At no point is the TCP send buffers filled up, really.</div><div><br></div><div>To play the bandwidth game, you need to saturate your outgoing TCP buffers, so when the kernel goes to work, it has a lot of stuff to work with.</div><div><br></div><div>What you are seeing is a common symptom: you are trading off latency, bandwidth utilization and correctness for each other. For messages of this small size and no other processing, you are essentially measuring a code path which includes a lot of context switches: between erlang processes and back'n'forth to the kernel. Since the concurrency of the system is fairly low (P is small), and we have a tight sending loop, you are going to lose with Erlang, every time. In a large system, the overhead you are seeing is fairly constant and thus it becomes irrelevant to the measurement.</div><div><br></div><div>If we change send_n to contain bigger terms[0]:</div><div><br></div><div><div>send_n(_, 0) -> ok;</div><div>send_n(Socket, N) -></div><div> N1 = {N, N},</div><div> N2 = {N1, N1},</div><div> N3 = {N2, N2},</div><div> N4 = {N3, N3},</div><div> N5 = {N4, N4},</div><div> N6 = {N5, N5},</div><div> N7 = {N6, N6},</div><div> gen_tcp:send(Socket, term_to_binary(N7)),</div><div> send_n(Socket, N-1).</div></div><div><br></div><div>Then we hit 394 megabit on this machine. Furthermore, we can't even maximize the two CPU cores anymore as they are only running at 50% utilization. So now we are hitting the OS bottlenecks instead, which you have to tune for otherwise. In this example, we avoid the synchronicity of the gen_tcp:send/2 path since we are sending more work to the underlying system. You can probably run faster, but then you need to tune the TCP socket options as they are not made for gigabit speed operation by default.</div><div><br></div><div>In order to figure all this out, I just ran</div><div><br></div><div>eprof:profile(fun() -> tcp_test:run_tcp(2000, 4, 1000*10) end).</div><div>eprof:log("foo.txt").</div><div>eprof:analyze().</div><div><br></div><div>and went to work by analyzing the profile output.</div><div><br></div><div>Docker or not, I believe there are other factors at play here...</div><div><br></div>[0] Small nice persistence trick used here: building a tree of exponential size in linear time.<br clear="all"><div><br></div>-- <br>J.
</div></div>