[erlang-questions] clueless performance question

Ulf Wiger (TN/EAB) <>
Thu Jun 12 10:56:01 CEST 2008


Bengt Kleberg skrev:
> Greetings,
> 
> The numbers I have seen for Erlang (on AXD301) is nine nines. Would not
> "factor of 10" mean that the perl phone switch ''only'' have to reach
> eight nines?

I wish people could use "better than 5 nines" rather than "9 nines".
Considering that planned downtime is included, it's still very, very
good, and we have consistent measurements from the field over several
years indicating that it's also true (I won't say how much better,
since it's not an officially disclosed figure). The system is designed
to reach 5 nines availability or better.

The reason why the "9 nines" figure came to be bandied about was that
Joe was looking for an officially approved number to use in his LL2
talk at MIT. The only one we found (that was explicitly "approved
for external communication") was a quote from British Telecom.

So while "9 nines" is true in a sense, it was for a particular
network, after a certain amount of time (in fact, the initial trial).
As such, it was a formidable success, far beyond the hopes of
anyone involved.

I'm not quite sure how one would design for 9 nines, but then,
one would also have to look closely at how downtime is actually
measured. In telecoms, one usually doesn't count disturbances
shorter than 30 seconds, which means that you can potentially
get away with quite a lot of short outages, as long as you
recover quickly enough. Obviously, with a "9 nines" target
(3.1 ms/year), one 30 second outage uses up the entire
budget for the next ten thousand years, so in this case, it's
a useless target. It could only be interesting in cases where
failure is a fatal incident, which must never happen.

Also, through the life of a telecoms product, high availability
is not the only metric that counts. Sometimes, those pesky
customers want something entirely new, like migrating from
ATM to IP. This overrides requirements of availability,
since even the peskiest customers realize that the cost of
ensuring no downtime when replacing the entire network
would be impossibly high (impossible?)

BTW, the 9 nines case of the AXD 301 referred to on single
case of local overload - due to an extreme peak of callers
to the "Pop Idol" show, causing one control processor to
reboot unnecessarily. The standby took over, and for 15
seconds, no calls were let through on the affected lines -
the rest of the system was unaffected. According to today's
way of measuring downtime, this actually wouldn't have
counted as downtime, and the official availability would
have been 100%...

Finally, let's remember that we cannot credit Erlang for
all the good availability of the AXD 301. For one thing,
we could replace the switch core without service
interruption (initially, it was only theoretically possible,
but didn't work in practice, since the rewiring required
the small hands of a 9 year-old, and various legal issues
prevented us from deploying such in the field for upgrades).

BR,
Ulf W



More information about the erlang-questions mailing list