handling partitioned networks

Wed Sep 27 12:47:34 CEST 2000

Uffe,

Thanks for the info - it is a tough nut no doubt about it.

My system uses all disc based tables so I'm not sure a scheme like you use
in AXD301 will quite fit.

In eddie they have some mechanism which tries to work out islands of nodes
which have been partitioned and then set_master_nodes and restart some of
the smaller groups. Again this doesn't seem appropriate for my system which
has pairs of redundant nodes holding persistent data.

I my case I shouldn't have lost too much data during a partitioned network
outage so better tools to manually bring the whole thing back up in parallel
without getting into deadlocks would be very useful..

e.g. set_master_nodes only seems to kick in if the nodes still detect they
are partitioned, it would be useful to be able to force a table/schema load
from another node regardless of the perceived status.

I have also seen research on schemes where logs are kept during partitioned
network outages and on startup the nodes negotiate conflicts and work
themselves back into a consistent state but this is pretty mindbending
stuff.

I'll think some more about each of your mechanisms.

Thanks and Rgds,
Sean
> -----Original Message-----
> From: Ulf Wiger [mailto:etxuwig@REDACTED]
> Sent: 27 September 2000 08:30
> To: erlang-questions@REDACTED
> Subject: handling partitioned networks
> 
> 
> 
> I read Sean Hinde's EUC2000 report -- good stuff!
> 
> There was a passage in there about the problems with partitioned
> networks. I agree, this is a very tough nut to crack.
> 
> I will describe some of the things we've done at AXD 301 to 
> address this problem:
> 
> 1. We have a fully redundant system with N mated pairs at the 
>    Erlang level.
> 
> 2. The first mated pair, termed the base pair, runs the O&M
>    functionality; if both of these nodes crash, the system is 
>    considered to be down; these nodes also have the mnesia 
>    schema on disk.
> 
> 3. All other nodes have a ram copy of the schema (using mnesia's
>    'extra_db_nodes' variable); if they lose contact with both
>    base nodes, they will restart.
> 
> 4. We have implemented a patch to net_kernel, which is supported
>    by OTP (in R5B and R7B as I understand it): 
>    "-kernel dist_auto_connect once" will allow nodes to automatically
>    connect only one time (happens when the "second" node starts up),
>    but as soon as communication fails, one of the nodes will have 
>    to restart for communication to be re-established (there is a
>    possibility to explicitly connect as well, but we don't use that)
> 
> 5. (4) is combined with a "backdoor" system, where a process on 
>    each node periodically sends a UDP "alive" message to all
>    other (statically known) nodes; upon receipt of an "alive"
>    message from a node which is not in the nodes() list, one can
>    conclude that the network has been partitioned. Through the
>    same UDP connection, the nodes can negotiate who should restart.
> 
> 6. Mnesia has a "master nodes" concept, where one can specify a 
>    set of nodes from which the tables should unconditionally be
>    loaded. When a node restarts to resolve inconsistency, it will
>    set master nodes to the other nodes known to be good at the time.
> 
> 7. There is a possibility of table load deadlock, where two nodes
>    cannot decide who has the most recent copies. To detect this,
>    we have a process calling mnesia:wait_for_tables/2 early in 
>    the startup phase. The table wait processes on each node
>    send messages to each other upon each wait_for_tables() timeout,
>    performing a WFG analysis to determine whether nodes are 
>    waiting for each other. This is not air tight (I think), because
>    nodes can go down or come up late during the table load phase
>    and mess things up, but I think we cover most possible events.
> 
> 
> Actually, I think most of this could be implemented in a fairly
> generic way. The part that needs to be customized for a particular
> system is mainly the logic deciding who should restart to resolve a
> partitioned network situation.
> 
> /Uffe
> -- 
> Ulf Wiger                                    tfn: +46  8 719 81 95
> Network Architecture & Product Strategies    mob: +46 70 519 81 95
> Ericsson Telecom AB,              Datacom Networks and IP Services
> Varuvägen 9, Älvsjö,                    S-126 25 Stockholm, Sweden
> 

NOTICE AND DISCLAIMER:
This email (including attachments) is confidential.  If you have received
this email in error please notify the sender immediately and delete this
email from your system without copying or disseminating it or placing any
reliance upon its contents.  We cannot accept liability for any breaches of
confidence arising through use of email.  Any opinions expressed in this
email (including attachments) are those of the author and do not necessarily
reflect our opinions.  We will not accept responsibility for any commitments
made by our employees outside the scope of our business.  We do not warrant
the accuracy or completeness of such information.