soft-upgrade vs failover and back to/from 2nd-ary system

Wed Jan 26 07:19:30 CET 2005

> A1) When you only have one node
>
> A2) When the state is long lived and difficult, or impossible, to
>     transfer from one node to another.
>
> HTTP is pretty much the opposite of A2. In many telco applications, A2
> describes the situation perfectly.

Matthias, can you give me an additional clarification w.r.t. to the 
telco domain. (A2) implies that if the system that owns the state 
crashes, the state is gone. I assume the telco applications you're 
referring to use a definition of availability that does not count such 
crashes as dropped calls?  I.e. are there telco applications in which 
one can loose the call signaling state and as long as the voice trunk 
remains up the call continues and is not counted as a drop?  I.e. as 
long as one is able to setup new calls (on a fresh backup system that 
needed none of the lost state transfered at all) life is good (modulo 
the lost opportunity to charge for a call)?

> On one telco voice application I
> worked on, a typical upgrade/patch meant:
>
>   1. Block new calls to the node.
>
>   2. Wait until all calls end (i.e. people finish talking).
>
>   3. Do the upgrade
>
>   4. Unblock
>
> Waiting for everyone to finish talking can take a long time. There are
> two ways to reduce the wait: first, upgrade in the middle of the
> night. Second, once you've waited (say) an hour, there'll just be a
> handful of callers left, so you could just disconnect them and let the
> helpdesk handle the complaints.
>
> On that system, we could also insert "small" patches by loading new
> code into Erlang. That eliminated all the waiting and sprinting down
> the corridor to escape enraged helpdesk people.
>
> ---
>
> Hot code loading isn't as general as 'evacuate-upgrade-restart' with
> isolated, duplicated hardware. You can't upgrade the OS or VM by
> reloading code. But in many systems it is simpler. In such systems you
> handle most upgrades without any downtime and then accept a couple of
> minutes per year of _planned_ downtime to upgrade the OS. In return,
> you get a simpler (== less unplanned downtime) and cheaper system.
>
> Matt
>