[erlang-questions] Re: error: Removing (timedout) connection

Thu Jan 13 16:51:46 CET 2011

http://www.erlang.org/cgi-bin/ezmlm-cgi?4:mss:53287:201009:fheblnkablgeikajmkoa

I feel dumb now, because I made a comment on that thread :)

Thanks Morten!

-Ryan

On Wed, Jan 12, 2011 at 2:56 AM, Morten Krogh <mk@REDACTED> wrote:

> Ryan, see the thread "node to node message passing" from September last
> year.
>
> Morten.
>
>
> On 1/12/11 8:15 AM, Ryan Zezeski wrote:
>
>> Update:
>>
>> I know that a big part of my problem was running that map-reduce query on
>> every node at the same time.  That's an expensive operation, especially
>> since I'm using the filesystem to back my luwak cluster.  With that in
>> mind,
>> I went ahead and migrated to riak-search and wrote a custom
>> extractor/schema
>> to index my luwak files.  The system, at least for the last 4+ hours, has
>> shown much more stability than what it had previously.  In fact, I've only
>> seen 4 of the timed-out connection errors thus far.  Furthermore, I
>> believe
>> I've pinpointed the cause of these errors.  All 4 of them occurred in a
>> particularly nasty piece of code where I call ets:tab2file, then read the
>> file into a binary, then zlib:gzip it and then finally send the compressed
>> binary to luwak via the native Erlang client.
>>
>> I'm wondering about that last part, sending it via Erlang external format
>> as
>> a binary.  Even compressed, these binaries can be as large as 120M!  Would
>> this be a potential problem, possibly delaying the net kernel ticks
>> between
>> nodes and causing my timeouts?  I imagine using the riak protocol buffers
>> interface might be a better choice?
>>
>> Thanks,
>>
>> -Ryan
>>
>> On Mon, Jan 10, 2011 at 2:20 PM, Ryan Zezeski<rzezeski@REDACTED>  wrote:
>>
>>  Hi guys/gals,
>>>
>>> Recently I've been converting my non-distributed Erlang app into a
>>> distributed one and I ran into some troubles.  If you want to skip
>>> straight
>>> to the question it's at the end, but I try to give some insight into what
>>> I'm doing below.
>>>
>>> First off, I attached a PDF (sorry, PDF was not my choice) which contains
>>> a
>>> diagram I drew of the current setup.  I apologize for my utter failure as
>>> an
>>> artist.  In this diagram you'll see 3 vertical partitions representing 3
>>> different machines and a horizontal one representing the fact that each
>>> machine has 2 Erland nodes on it.  3 of the Erlang nodes form a riak
>>> cluster.  The other 3 are the application (or should I say release) I
>>> wrote,
>>> and to distribute my app I utilized riak's underlying technology,
>>> riak_core
>>> (I use it as an easy way to persist cluster membership and use the ring
>>> metadata to store some data).  These six nodes are fully connected, i.e.
>>> each node has connection to the other.
>>>
>>> Occasionally, I've noticed the following message on any one of the six
>>> nodes:
>>>
>>> =ERROR REPORT==== ...
>>> ** Node<node>  not responding **
>>> ** Removing (timedout) connection **
>>>
>>> Furthermore, using net_kernel:monitor_nodes(true, [nodedown_reason]) I've
>>> noticed messages like the following:
>>>
>>> {nodedown,<node>, [{nodedown_reason, connection_closed}]}
>>>
>>>
>>> You'll notice there is a system process running on machine A, and it
>>> makes
>>> a gen_server:cast to three processes to do some work, and these processes
>>> each call link (L).  Each of these three (gen_server) processes makes a
>>> call
>>> (at roughly the same time) to the riak cluster performing the _same
>>> exact_
>>> map/reduce job.  Sometimes I'll see errors where this map/reduce job
>>> times
>>> out on one of the nodes.  So at lunch, I wondered, is it because there is
>>> just too much communication going on between the nodes that the kernel
>>> ticks
>>> are getting lost or delayed?  I wondered if each node was using the same
>>> TCP
>>> connection to talk to every other node.  That could explain my symptoms,
>>> right?  A few netcats later and I realized that it's a dedicated conn for
>>> each node, so that theory was blown.  However, I still think that many
>>> msgs
>>> being passed back and forth could be the cause of the problem, and I
>>> wondered if it blocks the VM in some way so that the kernel tick can't
>>> get
>>> through?
>>>
>>>
>>> Q: Can a chatty cluster cause the kernel ticks to be lost/delayed thus
>>> causing nodes to disconnect from each other?
>>>
>>> Thanks,
>>>
>>> -Ryan
>>>
>>>
>
> ________________________________________________________________
> erlang-questions (at) erlang.org mailing list.
> See http://www.erlang.org/faq.html
> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>
>