<div dir="ltr">You can create a process that calls erlang:monitor_node(Node, true) for all nodes expected to be in the cluster. Then whenever you receive a {nodedown, Node} in that process you can monitor_node the failed node after some cooldown time. Since monitor_node will attempt to connect to the node if it's not already connected, this would be enough to restore failed connections.<div><br></div><div>Cheers,</div><div>Daniel</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, 30 Nov 2021 at 06:57, saket chaudhary <<a href="mailto:saketcmf@gmail.com">saketcmf@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">There're no firewalls to speak of. Things do work as is all the time except when we hear of network activity with router or switch upgrades in some parts that we've got no control of. But our app needs to be resilient to that. Things also work when the entire cluster gets restarted.<div><br></div><div>What must be done to make sure we have a fully formed mesh that can withstand temporary disruptions and heal itself eventually? Should I write something that ensures every node pings every other node in the cluster that's statically configured?</div></div>

</blockquote></div>