[erlang-questions] Controlled interaction of two erlang distributed networks
Jayson Vantuyl
kagato@REDACTED
Wed Aug 26 03:28:45 CEST 2009
You could use global_group to split each site into its own group and
then make a node list of just the managers (which you already said you
don't want to do).
I think what you want is to roll you're own gossiped mesh like this.
Once you "introduce" a node to any other node in the mesh, it all just
syncs up eventually. This is a bit of work, but it really pays off.
Version 1:
Create an Mnesia table that holds tuples of the form
{manager_node,Name,Clock,IpAddr}. Name is an atom that identifies a
node (used to detect when they switch IP), and it needs to be unique
per node. IpAddr is exactly what it looks like. Clock is a value
that only increases and is used to detect old entries. This table
should start out empty, although it should persist between restarts.
Create a process that periodically gets the current IpAddress of the
node and updates the local table with it. It should bump up the clock
each time it changes the record. One way to do this is to start it
out and zero and increment. Another is to just use the current
machine timestamp. I'd recommend incrementing, as time-sync problems
are a pain to track down.
Create another process that periodically contacts all of the nodes and
distributes the table over some well known port, probably using UDP.
You don't have to reliably deliver it. If you're pedantic about
security, you might use TCP + SSL. Formatting this request is not
hard. Just use term_to_binary and binary_to_term.
Create another process that listens on some well known port for the
updates, and updates node entries. It should overwrite an entry with
the same Name, but only if the Clock is higher.
Now just create a utility that will "introduce" two nodes by putting
an entry for the other Node in its table. For this version, just
start with 0. After a little gossiping, they should know about each
other. The persistence of the table should allow changes of IP
address to be easily tolerated.
Version 2:
The previous version has a few problems. Most importantly, if you're
using term_to_binary, it's possible for people to leak atoms. Also,
you probably didn't encrypt it or put any sort of secret-key
encryption. Add this to make it secure.
Also, if you are starting with 0 on an "introduction", now is the time
to enhance that. If you lose the node and want to deploy a new one,
you can't just introduce it with 0, as the clock will already be
higher. Extend the "introduction" to ask the remote end for the clock
value it last saw for Name. This allows you to avoid that particular
pitfall.
Similarly, if you need to remove a node, use a special atom 'dead' in
the place of IpAddr. You might add this to the utility as well. This
should only be needed when a Name is permanently gone, and it mostly
exists to prevent gossips to an address that will never need them and
to prevent the node from being gossiped back into existence after you
delete its entry.
Version 3:
Now add functionality to gossip the processes you want to monitor.
NOTE: Just because a process isn't communicating, doesn't mean its
not there. Be very careful with what you are trying to do (assuming
you're doing automated failover, not just alerting). It's been well
established for decades that a system can only sport two out of the
three big qualities: "highly available", "tolerant of network
partitions", and "data is consistent". If you think you've made
something that does it, some very nasty mathematics will eventually
bite you in the rear.
You might ask why I wouldn't just monkey with Distributed Erlang to
separate the globals into groups (see the global_group module). The
problem is that Distributed Erlang still just isn't built for this
sort of use-case. In Distributed Erlang getting the SSL stuff working
right is hard, firewalls suck, NAT sucks even more, and the proper DNS
setup is more trouble than maintaining a node-list ever was (and
impossible to quickly fix, if you use high enough TTLs).
On Aug 25, 2009, at 5:17 PM, Richard Andrews wrote:
> I have a distributed system that needs to run on two geographically
> isolated sites. Each site has a central manager node.
>
> I want a way to have the manager nodes from each site use erlang
> monitoring of each other and some specific processes only on the
> manager nodes; but I want to avoid node lists and global registrations
> expanding such that the nodes on different sites try and become
> connected together.
>
> How can I achieve this?
>
> --
> Rich
>
> ________________________________________________________________
> erlang-questions mailing list. See http://www.erlang.org/faq.html
> erlang-questions (at) erlang.org
>
--
Jayson Vantuyl
kagato@REDACTED
More information about the erlang-questions
mailing list