[erlang-questions] clueless performance question

Thu Jun 12 13:59:15 CEST 2008

Christian S skrev:
> 
> If you want a very low mean-time-to-recover, then humans cant be
> involved. The system needs to fail over automatically, it needs to
> monitor itself, it needs to restart itself.   If you need a human to
> recover from an error, your system is not likely a system that makes
> the system owner happy.

Indeed. There are also different types of system:

- central-office applications, which are located in a place
   which is (perhaps) manned 24/7.
- remote systems, which are installed in places that are not
   manned (one operator in Iceland used to hang AXD 301 switches
   on the inside of the door to power converter stations;
   unfortunately, they weren't even heated...)
- systems that are impossible to get to (the Mars space probe
   comes to mind)

This obviously affects Mean Time To Repair, and thus, strategies
for recovery in the system - even from serious errors. Since
the AXD 301 is a remote system, we designed it to even recover
from partitioned networks and database inconsistency
automatically (at the price of potential loss of persistent data),
and we could also handle file system inconsistency (even though
that would have to be triggered through the network management
interface).

One really beautiful aspect of Erlang, though is the way software
can easily be made to "heal itself" from spurious errors. This
happens so quickly and unobtrusively that it can be easy to miss
in an unmanned system. We ended up adding logic to catch such
spurious restarts, and report if some named process kept
restarting seldom enough not to trigger the restart escalation
in the supervisor. Otherwise, we might miss the pattern even
when studying the wrap logs.

BR,
Ulf W