[erlang-questions] Automatically reconnecting nodes when they come back online

Fri Apr 26 19:00:35 CEST 2013

To all who know more about this than I do:

First, I'm just beginning to learn about Erlang/OTP so I figured I'd
use to implement something useful.

Part of what I'd like to build will involve a "conductor" controller
node that directs some other "player" nodes to all do something at
approximately the same time - ultimately to actually test the
operation of another piece of distributed software.  As part of those
operations, I expect the player nodes may sometimes crash (actually
cause a Windows BSOD in some cases) and then eventually come back to
life.

What I'm wondering about is what some folks have found to be good ways
of getting nodes to rejoin the cluster when they come back to life.
They way I'm thinking about it now, is that the player nodes will be
passive in the sense that they won't actively connect to any other
nodes - they'll only get connected when the conductor node invites
them in.  I'm also not looking for fault tolerance on the conductor
node at this point; if that one fails badly I'll just get some coffee
and rerun the scenario again.

My first two thoughts were:
1.  When the conductor node connects up the player nodes it would also
spawn a process whose sole job is to periodically ping the other nodes
to ensure they're connected.  Then when one goes down, those pings
will just fail during that time but when the node comes back a ping
will reconnect it to the other nodes.  All this time, I'd be
monitoring the node up/down messages.
2.  I'd start by monitoring all the nodes as the conductor connects
them and when receiving a node down message, spawn a process whose job
it is to periodically ping only that node only until it comes back.

Are there some good practices out there for systems that want to
behave like this?

Thanks in advance,

/stt