handling partitioned networks

Wed Sep 27 09:29:40 CEST 2000

I read Sean Hinde's EUC2000 report -- good stuff!

There was a passage in there about the problems with partitioned
networks. I agree, this is a very tough nut to crack.

I will describe some of the things we've done at AXD 301 to 
address this problem:

1. We have a fully redundant system with N mated pairs at the 
   Erlang level.

2. The first mated pair, termed the base pair, runs the O&M
   functionality; if both of these nodes crash, the system is 
   considered to be down; these nodes also have the mnesia 
   schema on disk.

3. All other nodes have a ram copy of the schema (using mnesia's
   'extra_db_nodes' variable); if they lose contact with both
   base nodes, they will restart.

4. We have implemented a patch to net_kernel, which is supported
   by OTP (in R5B and R7B as I understand it): 
   "-kernel dist_auto_connect once" will allow nodes to automatically
   connect only one time (happens when the "second" node starts up),
   but as soon as communication fails, one of the nodes will have 
   to restart for communication to be re-established (there is a
   possibility to explicitly connect as well, but we don't use that)

5. (4) is combined with a "backdoor" system, where a process on 
   each node periodically sends a UDP "alive" message to all
   other (statically known) nodes; upon receipt of an "alive"
   message from a node which is not in the nodes() list, one can
   conclude that the network has been partitioned. Through the
   same UDP connection, the nodes can negotiate who should restart.

6. Mnesia has a "master nodes" concept, where one can specify a 
   set of nodes from which the tables should unconditionally be
   loaded. When a node restarts to resolve inconsistency, it will
   set master nodes to the other nodes known to be good at the time.

7. There is a possibility of table load deadlock, where two nodes
   cannot decide who has the most recent copies. To detect this,
   we have a process calling mnesia:wait_for_tables/2 early in 
   the startup phase. The table wait processes on each node
   send messages to each other upon each wait_for_tables() timeout,
   performing a WFG analysis to determine whether nodes are 
   waiting for each other. This is not air tight (I think), because
   nodes can go down or come up late during the table load phase
   and mess things up, but I think we cover most possible events.

Actually, I think most of this could be implemented in a fairly
generic way. The part that needs to be customized for a particular
system is mainly the logic deciding who should restart to resolve a
partitioned network situation.

/Uffe
-- 
Ulf Wiger                                    tfn: +46  8 719 81 95
Network Architecture & Product Strategies    mob: +46 70 519 81 95
Ericsson Telecom AB,              Datacom Networks and IP Services
Varuvägen 9, Älvsjö,                    S-126 25 Stockholm, Sweden