error: Removing (timedout) connection

Wed Jan 12 08:15:03 CET 2011

Update:

I know that a big part of my problem was running that map-reduce query on
every node at the same time.  That's an expensive operation, especially
since I'm using the filesystem to back my luwak cluster.  With that in mind,
I went ahead and migrated to riak-search and wrote a custom extractor/schema
to index my luwak files.  The system, at least for the last 4+ hours, has
shown much more stability than what it had previously.  In fact, I've only
seen 4 of the timed-out connection errors thus far.  Furthermore, I believe
I've pinpointed the cause of these errors.  All 4 of them occurred in a
particularly nasty piece of code where I call ets:tab2file, then read the
file into a binary, then zlib:gzip it and then finally send the compressed
binary to luwak via the native Erlang client.

I'm wondering about that last part, sending it via Erlang external format as
a binary.  Even compressed, these binaries can be as large as 120M!  Would
this be a potential problem, possibly delaying the net kernel ticks between
nodes and causing my timeouts?  I imagine using the riak protocol buffers
interface might be a better choice?

Thanks,

-Ryan

On Mon, Jan 10, 2011 at 2:20 PM, Ryan Zezeski <rzezeski@REDACTED> wrote:

> Hi guys/gals,
>
> Recently I've been converting my non-distributed Erlang app into a
> distributed one and I ran into some troubles.  If you want to skip straight
> to the question it's at the end, but I try to give some insight into what
> I'm doing below.
>
> First off, I attached a PDF (sorry, PDF was not my choice) which contains a
> diagram I drew of the current setup.  I apologize for my utter failure as an
> artist.  In this diagram you'll see 3 vertical partitions representing 3
> different machines and a horizontal one representing the fact that each
> machine has 2 Erland nodes on it.  3 of the Erlang nodes form a riak
> cluster.  The other 3 are the application (or should I say release) I wrote,
> and to distribute my app I utilized riak's underlying technology, riak_core
> (I use it as an easy way to persist cluster membership and use the ring
> metadata to store some data).  These six nodes are fully connected, i.e.
> each node has connection to the other.
>
> Occasionally, I've noticed the following message on any one of the six
> nodes:
>
> =ERROR REPORT==== ...
> ** Node <node> not responding **
> ** Removing (timedout) connection **
>
> Furthermore, using net_kernel:monitor_nodes(true, [nodedown_reason]) I've
> noticed messages like the following:
>
> {nodedown, <node>, [{nodedown_reason, connection_closed}]}
>
>
> You'll notice there is a system process running on machine A, and it makes
> a gen_server:cast to three processes to do some work, and these processes
> each call link (L).  Each of these three (gen_server) processes makes a call
> (at roughly the same time) to the riak cluster performing the _same exact_
> map/reduce job.  Sometimes I'll see errors where this map/reduce job times
> out on one of the nodes.  So at lunch, I wondered, is it because there is
> just too much communication going on between the nodes that the kernel ticks
> are getting lost or delayed?  I wondered if each node was using the same TCP
> connection to talk to every other node.  That could explain my symptoms,
> right?  A few netcats later and I realized that it's a dedicated conn for
> each node, so that theory was blown.  However, I still think that many msgs
> being passed back and forth could be the cause of the problem, and I
> wondered if it blocks the VM in some way so that the kernel tick can't get
> through?
>
>
> Q: Can a chatty cluster cause the kernel ticks to be lost/delayed thus
> causing nodes to disconnect from each other?
>
> Thanks,
>
> -Ryan
>