[erlang-questions] [cyclone-l] memory safety bug!

Fri Jan 12 00:05:20 CET 2007

[cross-posted from the Cyclone mailing list to the Erlang list]

Michael Hicks wrote:
> Looks like memory safety is a concern for NASA too:
> 
> http://www.spaceref.com/news/viewnews.html?id=1185

"We think that the failure was due to a software load we sent up in June of last
 year. This software tried to synch up two flight processors. Two addresses were
 incorrect - two memory addresses were over written. As the geometry evolved, we
 drove the [solar] arrays against a hard stop and the spacecraft went into safe
 mode. The radiator for the battery pointed at the sun, the temperature went up,
 and battery failed. But this should be treated as preliminary."

The discussion below assumes that this brief description is accurate, as far as
it goes.

It sounds like memory safety would have been necessary, but not sufficient to
avoid mission failure. Memory-safety doesn't prevent run-time errors [*]; it only
turns them into "nicer" fault behaviour, for example an exception or a trap to an
emergency handler. So what would probably have happened in a memory-safe language
is that spacecraft would have gone into safe mode earlier, as a result of whatever
fault caused the "two memory addresses [to be] overwritten". However, no memory
would have been corrupted as a result of this fault.

Upgrading software in flight is always, and foreseeably, a risky operation.
What is needed to recover from this kind of situation is a 'downgrade' facility
as well as memory safety. By downgrade, I mean a facility that allows the system
to go back to a previous state and software configuration in case an upgrade
fails. An example of a language that provides this is Erlang (see section 3.8
of <http://www.sics.se/~joe/thesis/armstrong_thesis_2003.pdf>, although that
doesn't go into much detail specifically about downgrade).

Going into safe mode probably should not have caused the battery radiator to
point continuously to the sun; that sounds like a separate design problem. But
a downgrade facility would have allowed a failed upgrade to very quickly cause
the spacecraft to go back to the previous, known-working software, instead of
into safe mode. (There are other kinds of plausible fault that could be recovered
if caught soon enough, such as faults that would cause fuel or power to be
expended.)

Mars -> Earth -> Mars communication latency varies between 6.5 and 45 minutes,
which might be too long for successful recovery if the downgrade has to be
triggered from Earth after observing the fault. So downgrade should probably be
automatic, if a fault occurs within the current roundtrip communication time of
an upgrade.

[*] Memory safety can make errors more reproducible and visible during testing,
    but let's assume for the sake of argument that whatever defect was in the
    new software version would still have got through ground-based testing.

-- 
David Hopwood <david.nospam.hopwood@REDACTED>