soft-upgrade vs failover and back to/from 2nd-ary system
Sat Jan 22 07:08:43 CET 2005
I admire the soft-upgrade approach in Erlang and OTP's support for
coordination thereof very much, but am puzzled why it had to be
invented. It must be due to my ignorance of the context of the problem
(and the protocols involved) I'm sure - help me out please!
Context: Imagine a system that requires two nodes for fault tolerance.
Each node must be able to take over the other node's traffic (and state
if protocols are stateful) at any one point to handle the fault of one
For such architectures, system upgrade can be performed by artificially
evacuating a node, restarting it (VM process) with the new version of
the code and rebalancing the traffic. This works really well if the
protocols used to talk to these nodes support some form of redirection
(either in the sender process, or in an intermediary such as a load
balancer for http traffic).
Q: When does it not work well?
Q: Are there guidelines as to when I should rather invest in writing
soft-upgradable code when I can get away with the above brute force
approach to system upgrade?
Q: Many systems that run Erlang do indeed contain redundant CPU boards
(or multiple machines). Is there an easy way to characterize why the
brute force upgrade approach did not work in those systems (AXD 301
comes to mind of course) and the soft-upgrade approach had to be
I could not find guidelines for when to use brute-force upgrade in a
dual node system vs soft-upgrade in the documentation or papers
(comparing the two in general terms, or specific examples of pro/cons)
- can anyone point me at material? I fear the answer must be obvious
or trivial, or left to the reader ;-) In reality I found that live
system upgrade is a massive headache (for successful system only ;-)
and it's odd that not more is written about how to architect for it
from the the beginning, what the limitations and pitfalls are with
either approach etc.
More information about the erlang-questions