handling partitioned networks
Ulf Wiger
etxuwig@REDACTED
Wed Sep 27 09:29:40 CEST 2000
I read Sean Hinde's EUC2000 report -- good stuff!
There was a passage in there about the problems with partitioned
networks. I agree, this is a very tough nut to crack.
I will describe some of the things we've done at AXD 301 to
address this problem:
1. We have a fully redundant system with N mated pairs at the
Erlang level.
2. The first mated pair, termed the base pair, runs the O&M
functionality; if both of these nodes crash, the system is
considered to be down; these nodes also have the mnesia
schema on disk.
3. All other nodes have a ram copy of the schema (using mnesia's
'extra_db_nodes' variable); if they lose contact with both
base nodes, they will restart.
4. We have implemented a patch to net_kernel, which is supported
by OTP (in R5B and R7B as I understand it):
"-kernel dist_auto_connect once" will allow nodes to automatically
connect only one time (happens when the "second" node starts up),
but as soon as communication fails, one of the nodes will have
to restart for communication to be re-established (there is a
possibility to explicitly connect as well, but we don't use that)
5. (4) is combined with a "backdoor" system, where a process on
each node periodically sends a UDP "alive" message to all
other (statically known) nodes; upon receipt of an "alive"
message from a node which is not in the nodes() list, one can
conclude that the network has been partitioned. Through the
same UDP connection, the nodes can negotiate who should restart.
6. Mnesia has a "master nodes" concept, where one can specify a
set of nodes from which the tables should unconditionally be
loaded. When a node restarts to resolve inconsistency, it will
set master nodes to the other nodes known to be good at the time.
7. There is a possibility of table load deadlock, where two nodes
cannot decide who has the most recent copies. To detect this,
we have a process calling mnesia:wait_for_tables/2 early in
the startup phase. The table wait processes on each node
send messages to each other upon each wait_for_tables() timeout,
performing a WFG analysis to determine whether nodes are
waiting for each other. This is not air tight (I think), because
nodes can go down or come up late during the table load phase
and mess things up, but I think we cover most possible events.
Actually, I think most of this could be implemented in a fairly
generic way. The part that needs to be customized for a particular
system is mainly the logic deciding who should restart to resolve a
partitioned network situation.
/Uffe
--
Ulf Wiger tfn: +46 8 719 81 95
Network Architecture & Product Strategies mob: +46 70 519 81 95
Ericsson Telecom AB, Datacom Networks and IP Services
Varuvägen 9, Älvsjö, S-126 25 Stockholm, Sweden
More information about the erlang-questions
mailing list