<div dir="ltr"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">We have a set of designated erlang nodes that every other node routinely pings and the hope is that they'll all discover the rest of each other through them.</div></blockquote><div><br></div><div>The discovery of neighbours only happens when establishing connection to a node. If I understand correctly, you do repeated pings to restore broken connections too, but that won't work. E.g. if node A and B are connected, and a freshly started node C pings B, it will learn about A and connect to it, forming a full mesh. However, if later the connection between A and C breaks down, but B remains connected to both A and C, pinging B from either A or C won't help restoring the broken connection.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Restarting this subset didn't really help.</div></blockquote><div>This is strange, restarting a node would imply it makes a new connection to a central node and from there should be able to discover and connect to the entire network. Are you sure there is nothing like a firewall on the network that would prevent island nodes from making outgoing connections to the non-central nodes, but would let all nodes make outgoing connections to the island nodes?</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">We also ran into problems with global registration calls that were stuck on those island nodes but we think that's more of a symptom than a cause as we ran partitioned.</div></blockquote><div>If a global call is stuck, it suggests the global_name_server processes know about all the nodes, but can't communicate with them. Or maybe the global_name_server processes run into some kind of deadlock waiting for each other and thus not handling requests? That could even explain why restarting the nodes one-by-one didn't help (the neighbour discovery is the responsibility of gloal_name_servers).</div><div><br></div><div>I'd suggest finding out what the global_name_servers are up to if this problem occurs again. You can check with process_info whether they are idling (the current function being gen_server:loop/7) or not, then query the state of idle processes with sys:get_state/1 and the entire call stack of busy processes with process_info(whereis(global_name_server), backtrace). Inspecting the global_* ETS tables may also help understanding the situation.</div><div><br></div><div>Cheers,</div><div>Daniel</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
</blockquote></div></div>