soft-upgrade vs failover and back to/from 2nd-ary system

Sat Jan 22 07:08:43 CET 2005

I admire the soft-upgrade approach in Erlang and OTP's support for 
coordination thereof very much, but am puzzled why it had to be 
invented. It must be due to my ignorance of the context of the problem 
(and the protocols involved) I'm sure - help me out please!

Context: Imagine a system that requires two nodes for fault tolerance. 
Each node must be able to take over the other node's traffic (and state 
if protocols are stateful) at any one point to handle the fault of one 
of them.

For such architectures, system upgrade can be performed by artificially 
evacuating a node, restarting it (VM process) with the new version of 
the code and rebalancing the traffic. This works really well if the 
protocols used to talk to these nodes support some form of redirection 
(either in the sender process, or in an intermediary such as a load 
balancer for http traffic).

Q: When does it not work well?

Q: Are there guidelines as to when I should rather invest in writing 
soft-upgradable code when I can get away with the above brute force 
approach to system upgrade?

Q: Many systems that run Erlang do indeed contain redundant CPU boards 
(or multiple machines). Is there an easy way to characterize why the 
brute force upgrade approach did not work in those systems (AXD 301 
comes to mind of course) and the soft-upgrade approach had to be 
invented?

I could not find guidelines for when to use brute-force upgrade in a 
dual node system vs soft-upgrade in the documentation or papers 
(comparing the two in general terms, or specific examples of pro/cons) 
- can anyone point me at material?  I fear the answer must be obvious 
or trivial, or left to the reader ;-)   In reality I found that live 
system upgrade is a massive headache (for successful system only ;-) 
and it's odd that not more is written about how to architect for it 
from the the beginning, what the limitations and pitfalls are with 
either approach etc.

Thanks,
- Reto