error: Removing (timedout) connection
Ryan Zezeski
rzezeski@REDACTED
Wed Jan 12 08:15:03 CET 2011
Update:
I know that a big part of my problem was running that map-reduce query on
every node at the same time. That's an expensive operation, especially
since I'm using the filesystem to back my luwak cluster. With that in mind,
I went ahead and migrated to riak-search and wrote a custom extractor/schema
to index my luwak files. The system, at least for the last 4+ hours, has
shown much more stability than what it had previously. In fact, I've only
seen 4 of the timed-out connection errors thus far. Furthermore, I believe
I've pinpointed the cause of these errors. All 4 of them occurred in a
particularly nasty piece of code where I call ets:tab2file, then read the
file into a binary, then zlib:gzip it and then finally send the compressed
binary to luwak via the native Erlang client.
I'm wondering about that last part, sending it via Erlang external format as
a binary. Even compressed, these binaries can be as large as 120M! Would
this be a potential problem, possibly delaying the net kernel ticks between
nodes and causing my timeouts? I imagine using the riak protocol buffers
interface might be a better choice?
Thanks,
-Ryan
On Mon, Jan 10, 2011 at 2:20 PM, Ryan Zezeski <rzezeski@REDACTED> wrote:
> Hi guys/gals,
>
> Recently I've been converting my non-distributed Erlang app into a
> distributed one and I ran into some troubles. If you want to skip straight
> to the question it's at the end, but I try to give some insight into what
> I'm doing below.
>
> First off, I attached a PDF (sorry, PDF was not my choice) which contains a
> diagram I drew of the current setup. I apologize for my utter failure as an
> artist. In this diagram you'll see 3 vertical partitions representing 3
> different machines and a horizontal one representing the fact that each
> machine has 2 Erland nodes on it. 3 of the Erlang nodes form a riak
> cluster. The other 3 are the application (or should I say release) I wrote,
> and to distribute my app I utilized riak's underlying technology, riak_core
> (I use it as an easy way to persist cluster membership and use the ring
> metadata to store some data). These six nodes are fully connected, i.e.
> each node has connection to the other.
>
> Occasionally, I've noticed the following message on any one of the six
> nodes:
>
> =ERROR REPORT==== ...
> ** Node <node> not responding **
> ** Removing (timedout) connection **
>
> Furthermore, using net_kernel:monitor_nodes(true, [nodedown_reason]) I've
> noticed messages like the following:
>
> {nodedown, <node>, [{nodedown_reason, connection_closed}]}
>
>
> You'll notice there is a system process running on machine A, and it makes
> a gen_server:cast to three processes to do some work, and these processes
> each call link (L). Each of these three (gen_server) processes makes a call
> (at roughly the same time) to the riak cluster performing the _same exact_
> map/reduce job. Sometimes I'll see errors where this map/reduce job times
> out on one of the nodes. So at lunch, I wondered, is it because there is
> just too much communication going on between the nodes that the kernel ticks
> are getting lost or delayed? I wondered if each node was using the same TCP
> connection to talk to every other node. That could explain my symptoms,
> right? A few netcats later and I realized that it's a dedicated conn for
> each node, so that theory was blown. However, I still think that many msgs
> being passed back and forth could be the cause of the problem, and I
> wondered if it blocks the VM in some way so that the kernel tick can't get
> through?
>
>
> Q: Can a chatty cluster cause the kernel ticks to be lost/delayed thus
> causing nodes to disconnect from each other?
>
> Thanks,
>
> -Ryan
>
More information about the erlang-questions
mailing list