[erlang-questions] Slow node replication

John VanderPol <>
Thu Sep 20 17:01:37 CEST 2012


In an application that I manage we are currently having issues discovering
all of the other erlang nodes in a cluster.  We start up our application
and immediately run net_adm:ping/1 to a known node in order to discover all
of the other nodes.  The problem we are having is that although the initial
ping command is successful the other nodes are not "discovered" by the new
node until upwards of 10 minutes have gone by, when I say "discovered" i
mean that when nodes() is called the only node returned is the initial node
that was pinged.  So for example if nodes A and B are currently running, we
start up node C, node C pings node B, then it will take a substantially
long period of time to discover node A.

Some debugging nodes:
When the application is initially started up on all of our nodes this is
not a problem and nodes discover each other quickly, it only happens after
the application has been running for a while.
All other node communication seems to be performing reasonably fast.
We are monitoring our applications with
https://github.com/lethain/nagios_erlang which is a erlang plugin for
nagios.  It simply starts up an erlang node, pings all of our nodes to
ensure they are up an running, and then shuts down.  These test nodes end
up in the known nodes list but are mostly never in the connected node list.

Some information about the environment:
Erlang release: R14B03
Number of nodes: ~40
OS: CentOS
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20120920/b95e73fb/attachment.html>


More information about the erlang-questions mailing list