soft-upgrade vs failover and back to/from 2nd-ary system

Wed Jan 26 11:33:59 CET 2005

 Matthias> > A2) When the state is long lived and difficult, or 
 Matthias> >     impossible, to transfer from one node to another.
[...]
 Matthias> > In many telco applications, A2 describes the situation perfectly.

    Reto> Matthias, can you give me an additional clarification
    Reto> w.r.t. to the telco domain. (A2) implies that if the system
    Reto> that owns the state crashes, the state is gone. I assume the
    Reto> telco applications you're referring to use a definition of
    Reto> availability that does not count such crashes as dropped
    Reto> calls?  

Such events should be (and are) counted as dropped calls. But one
dropped call isn't the end of the world. It happens. That's why the
requirements specify nonzero limits to the number of dropped calls. 

The granularity of fault recovery is a design choice you make once
you've seen the requirements. Take a voicemail system. Imagine someone
pulls out both power plugs while you're listening to one of your
messages. Some possible ways the system could appear to the
subscriber:

   1. You never notice anything, i.e. the message keeps playing
      without so much as a hiccup.

or 2. There's a slight pop in the middle of the message

or 3. The whole system hiccups, e.g. the message starts over from the 
      start, or perhaps you go back to the menu.

or 4. The call gets dropped, i.e you have to call voicemail again

or 5. The call gets dropped. You try and call voicemail again but
      it's busy. You try again after five minutes and it works.

or 6. The call gets dropped and it takes several hours before voicemail
      works again.

or 7. All your messages get erased

I think all good voicemail systems settle for #4. Trying to do better
than that introduces a lot of complexity to deal with an unlikely
event. Cheap systems do #5. #6 and #7 are unacceptable.

Maybe I exaggerated the difference to HTTP. If I was on CNN's homepage
and the browser was in the middle of downloading the large picture on
the front page when someone pulled the power plug(s) on the CNN
webserver I happened to be using, I'd be pretty surprised if the load
balancer/failover system was smart enough to transfer the HTTP and TCP
state so that the image arrived whole anyway.

There are people who make voicemail (and IVR) systems _and_ HTTP
robustifiers on this list. Maybe they'd care to comment what their
systems do.

    Reto> I.e. are there telco applications in which one can
    Reto> loose the call signaling state and as long as the voice
    Reto> trunk remains up the call continues and is not counted as a
    Reto> drop?  I.e. as long as one is able to setup new calls (on a
    Reto> fresh backup system that needed none of the lost state
    Reto> transfered at all) life is good (modulo the lost opportunity
    Reto> to charge for a call)?

Keeping the voice connection up when the signalling state has been
lost is bad. It leaks connection resources and leaves subscribers
stuck in broken calls. Better to keep it simple and just drop the call
that triggered the problem.

Matthias