[erlang-questions] Re: error: Removing (timedout) connection

Wed Jan 12 08:56:07 CET 2011

Ryan, see the thread "node to node message passing" from September last 
year.

Morten.

On 1/12/11 8:15 AM, Ryan Zezeski wrote:
> Update:
>
> I know that a big part of my problem was running that map-reduce query on
> every node at the same time.  That's an expensive operation, especially
> since I'm using the filesystem to back my luwak cluster.  With that in mind,
> I went ahead and migrated to riak-search and wrote a custom extractor/schema
> to index my luwak files.  The system, at least for the last 4+ hours, has
> shown much more stability than what it had previously.  In fact, I've only
> seen 4 of the timed-out connection errors thus far.  Furthermore, I believe
> I've pinpointed the cause of these errors.  All 4 of them occurred in a
> particularly nasty piece of code where I call ets:tab2file, then read the
> file into a binary, then zlib:gzip it and then finally send the compressed
> binary to luwak via the native Erlang client.
>
> I'm wondering about that last part, sending it via Erlang external format as
> a binary.  Even compressed, these binaries can be as large as 120M!  Would
> this be a potential problem, possibly delaying the net kernel ticks between
> nodes and causing my timeouts?  I imagine using the riak protocol buffers
> interface might be a better choice?
>
> Thanks,
>
> -Ryan
>
> On Mon, Jan 10, 2011 at 2:20 PM, Ryan Zezeski<rzezeski@REDACTED>  wrote:
>
>> Hi guys/gals,
>>
>> Recently I've been converting my non-distributed Erlang app into a
>> distributed one and I ran into some troubles.  If you want to skip straight
>> to the question it's at the end, but I try to give some insight into what
>> I'm doing below.
>>
>> First off, I attached a PDF (sorry, PDF was not my choice) which contains a
>> diagram I drew of the current setup.  I apologize for my utter failure as an
>> artist.  In this diagram you'll see 3 vertical partitions representing 3
>> different machines and a horizontal one representing the fact that each
>> machine has 2 Erland nodes on it.  3 of the Erlang nodes form a riak
>> cluster.  The other 3 are the application (or should I say release) I wrote,
>> and to distribute my app I utilized riak's underlying technology, riak_core
>> (I use it as an easy way to persist cluster membership and use the ring
>> metadata to store some data).  These six nodes are fully connected, i.e.
>> each node has connection to the other.
>>
>> Occasionally, I've noticed the following message on any one of the six
>> nodes:
>>
>> =ERROR REPORT==== ...
>> ** Node<node>  not responding **
>> ** Removing (timedout) connection **
>>
>> Furthermore, using net_kernel:monitor_nodes(true, [nodedown_reason]) I've
>> noticed messages like the following:
>>
>> {nodedown,<node>, [{nodedown_reason, connection_closed}]}
>>
>>
>> You'll notice there is a system process running on machine A, and it makes
>> a gen_server:cast to three processes to do some work, and these processes
>> each call link (L).  Each of these three (gen_server) processes makes a call
>> (at roughly the same time) to the riak cluster performing the _same exact_
>> map/reduce job.  Sometimes I'll see errors where this map/reduce job times
>> out on one of the nodes.  So at lunch, I wondered, is it because there is
>> just too much communication going on between the nodes that the kernel ticks
>> are getting lost or delayed?  I wondered if each node was using the same TCP
>> connection to talk to every other node.  That could explain my symptoms,
>> right?  A few netcats later and I realized that it's a dedicated conn for
>> each node, so that theory was blown.  However, I still think that many msgs
>> being passed back and forth could be the cause of the problem, and I
>> wondered if it blocks the VM in some way so that the kernel tick can't get
>> through?
>>
>>
>> Q: Can a chatty cluster cause the kernel ticks to be lost/delayed thus
>> causing nodes to disconnect from each other?
>>
>> Thanks,
>>
>> -Ryan
>>