<div>In an application that I manage we are currently having issues discovering all of the other erlang nodes in a cluster. We start up our application and immediately run net_adm:ping/1 to a known node in order to discover all of the other nodes. The problem we are having is that although the initial ping command is successful the other nodes are not "discovered" by the new node until upwards of 10 minutes have gone by, when I say "discovered" i mean that when nodes() is called the only node returned is the initial node that was pinged. So for example if nodes A and B are currently running, we start up node C, node C pings node B, then it will take a substantially long period of time to discover node A.</div>
<div><br></div><div>Some debugging nodes:</div><div>When the application is initially started up on all of our nodes this is not a problem and nodes discover each other quickly, it only happens after the application has been running for a while.</div>
<div>All other node communication seems to be performing reasonably fast.</div><div>We are monitoring our applications with <a href="https://github.com/lethain/nagios_erlang">https://github.com/lethain/nagios_erlang</a> which is a erlang plugin for nagios. It simply starts up an erlang node, pings all of our nodes to ensure they are up an running, and then shuts down. These test nodes end up in the known nodes list but are mostly never in the connected node list.</div>
<div><br></div><div>Some information about the environment:</div><div>Erlang release: R14B03</div><div>Number of nodes: ~40</div><div>OS: CentOS</div>