soft-upgrade vs failover and back to/from 2nd-ary system

Sun Jan 23 21:00:18 CET 2005

Reto Kramer writes:

 > Context: Imagine a system that requires two nodes for fault tolerance. 
 > Each node must be able to take over the other node's traffic (and state 
 > if protocols are stateful) at any one point to handle the fault of one 
 > of them.

 > For such architectures, system upgrade can be performed by artificially 
 > evacuating a node, restarting it (VM process) with the new version of 
 > the code and rebalancing the traffic. This works really well if the 
 > protocols used to talk to these nodes support some form of redirection 
 > (either in the sender process, or in an intermediary such as a load 
 > balancer for http traffic).

 > Q: When does it not work well?

A1) When you only have one node

A2) When the state is long lived and difficult, or impossible, to 
    transfer from one node to another.

HTTP is pretty much the opposite of A2. In many telco applications, A2
describes the situation perfectly. On one telco voice application I
worked on, a typical upgrade/patch meant:

  1. Block new calls to the node.

  2. Wait until all calls end (i.e. people finish talking). 

  3. Do the upgrade

  4. Unblock 

Waiting for everyone to finish talking can take a long time. There are
two ways to reduce the wait: first, upgrade in the middle of the
night. Second, once you've waited (say) an hour, there'll just be a
handful of callers left, so you could just disconnect them and let the
helpdesk handle the complaints.

On that system, we could also insert "small" patches by loading new
code into Erlang. That eliminated all the waiting and sprinting down
the corridor to escape enraged helpdesk people. 

---

Hot code loading isn't as general as 'evacuate-upgrade-restart' with
isolated, duplicated hardware. You can't upgrade the OS or VM by
reloading code. But in many systems it is simpler. In such systems you
handle most upgrades without any downtime and then accept a couple of
minutes per year of _planned_ downtime to upgrade the OS. In return,
you get a simpler (== less unplanned downtime) and cheaper system. 

Matt