[erlang-questions] TCP buffering

Thu Dec 4 22:31:03 CET 2014

Hi List,

I need some TCP help and advice on how to manage buffer sizes from the
gen_tcp api.

We have a system made up of 4 basic node types, lets call the A, B, C & DB
(all running R15B), each of which can have multiple instances. We also have
a communications protocol that runs over tcp links between the different
node types and works fine on the connections between A & B and B & C, but
on the connections between B & DB we've been getting some strange behaviour.

DB is a node that basically just runs mnesia and is the data store for the
system, if that's relevant, and connections to it also work fine for a few
days after it restarts. But after a few days we seem to get "chokes" in the
TCP communications at very regular 7 minute intervals. The rest of the VM
stays working but messages in the TCP link take up to 8 seconds to reach
their destination, causing timeouts on the higher level protocol.
These "chokes" are regular across peak & quiet times and cause a similar
proportion of timeouts regardless of the traffic level. (Traffic comprises
of simple non-blocking requests and responses)

I've been investigating and have become focussed on the tcp buffer sizing,
though I've no concrete evidence that this is actually the problem and my
TCP knowledge before this investigation was more or less restricted to
what's exposed through gen_tcp. So please advise if you think there may be
another source.

What I've found is that on initial connection both sndbuf & recbuf are set
to 10MB, and after a few days when we see these problems TCP has resized
them down to 49KB. On the other links where there are no problems the
buffers still have their original sizes. But for some reason inet:setopts
won't resize these 49KB buffers in the live site the way it will in my test
environment.

And just now I've discovered the separate buffer parameter that I didn't
know about before, from the OTP docs this one should be larger than the
larger of sndbuf & recbuf but on my problematic link I have these values:
[{buffer,1460},{sndbuf,49152},{recbuf,49640}].
In my "good" links this is set to 10MB, just like sndbuf & recbuf, even
though we didn't explicitly set it.

So my questions are:
- What governs this TCP resizing, I know it's in the protocol but what
traffic patterns might cause this?
- How can I resize my buffers once I'm in this state?
- Are the buffer sizes the likely cause of the "chokes" I'm observing?

Thanks in advance!
//Sean.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20141204/7b1bd482/attachment.htm>