soft-upgrade vs failover and back to/from 2nd-ary system

Sat Jan 22 18:29:18 CET 2005

Den 2005-01-22 07:08:43 skrev Reto Kramer <kramer@REDACTED>:

> Context: Imagine a system that requires two nodes for fault tolerance.  
> Each node must be able to take over the other node's traffic (and state  
> if protocols are stateful) at any one point to handle the fault of one  
> of them.
>
> For such architectures, system upgrade can be performed by artificially  
> evacuating a node, restarting it (VM process) with the new version of  
> the code and rebalancing the traffic. This works really well if the  
> protocols used to talk to these nodes support some form of redirection  
> (either in the sender process, or in an intermediary such as a load  
> balancer for http traffic).
>
> Q: When does it not work well?

There are indeed good reasons to always upgrade a redundant system
using the redundancy mechanisms - esp. since that mechanism sometimes
is the only reasonable option.

For systems that have no redundancy, soft upgrade is a better option
than to design for redundancy anyway and then e.g. starting a second
node and doing a redundancy upgrade. One could of course argue that
if the system has no redundancy, then downtime during upgrade must
be acceptable.

> Q: Are there guidelines as to when I should rather invest in writing  
> soft-upgradable code when I can get away with the above brute force  
> approach to system upgrade?

For debugging and patching, soft upgrade is superb. You can fairly
easily write code that is soft-upgradeable in Erlang/OTP, and using
it, you can swiftly load instrumented code or correct minor software
bugs without the users even noticing.

I've had occasions where I've developed server applications, and
had the server up and running all the time, always correcting errors
and adding new features through soft upgrade, and not restarting the
server for weeks. Very convenient, even if not perhaps strictly
necessary.

> Q: Many systems that run Erlang do indeed contain redundant CPU boards  
> (or multiple machines). Is there an easy way to characterize why the  
> brute force upgrade approach did not work in those systems (AXD 301  
> comes to mind of course) and the soft-upgrade approach had to be  
> invented?

AXD 301 supports a wide range of upgrade techniques, from soft upgrade
to system reboot with an upgraded configuration database. One reason for
this is that the AXD301 project started roughly at the same time as
the first version of OTP was being developed. Our understanding of
software upgrade using OTP in a very large system was understandably
poor in the beginning (it had never been done before!), so we kept
inventing ways to do it, until we eventually had support for almost
all techniques you can think of. (:

Redundancy upgrade is in there somewhere between the extremes, and is
one of the more useful techniques, but soft upgrade is used quite
often, esp. for error correction packages.

> I could not find guidelines for when to use brute-force upgrade in a  
> dual node system vs soft-upgrade in the documentation or papers  
> (comparing the two in general terms, or specific examples of pro/cons) -  
> can anyone point me at material?  I fear the answer must be obvious or  
> trivial, or left to the reader ;-)   In reality I found that live system  
> upgrade is a massive headache (for successful system only ;-) and it's  
> odd that not more is written about how to architect for it from the the  
> beginning, what the limitations and pitfalls are with either approach  
> etc.

I don't think such documentation exists, unfortunately.
And I agree - live system upgrade _is_ a massive headache, esp. of large
systems.

Regards,
Uffe
-- 
Använder Operas banbrytande e-postklient: http://www.opera.com/m2/