Saturday, 19 November 2005

Environmentally Adaptive Fault-Tolerant Computing

A Space post this one, and also one on Software Engineering. From NASA :
When your computer behaves erratically, mauls your data, or just "crashes" completely, it can be frustrating. But for an astronaut trusting a computer to run navigation and life-support systems, computer glitches could be fatal.

Unfortunately, the radiation that pervades space can trigger such glitches. When high-speed particles, such as cosmic rays, collide with the microscopic circuitry of computer chips, they can cause chips to make errors. If those errors send the spacecraft flying off in the wrong direction or disrupt the life-support system, it could be bad news.
Yes, an anomaly like that could ruin your whole day. "Disasters", "Catastrophes", even "Accidents" are not in the vocabulary of Rocket Science. They're all just "anomalies", meaning something didn't quite go the way you expected.
To ensure safety, most space missions use radiation hardened computer chips. "Rad-hard" chips are unlike ordinary chips in many ways. For example, they contain extra transistors that take more energy to switch on and off. Cosmic rays can't trigger them so easily. Rad-hard chips continue to do accurate calculations when ordinary chips might "glitch."

NASA relies almost exclusively on these extra-durable chips to make computers space-worthy. But these custom-made chips have some downsides: They're expensive, power hungry, and slow -- as much as 10 times slower than an equivalent CPU in a modern consumer desktop PC.
Closer to 100 times, these days.
Using the same inexpensive, powerful Pentium and PowerPC chips found in consumer PCs would help tremendously, but to do so, the problem of radiation-induced errors must be solved.

This is where a NASA project called Environmentally Adaptive Fault-Tolerant Computing (EAFTC) comes in. Researchers working on the project are experimenting with ways to use consumer CPUs in space missions. They're particularly interested in "single event upsets," the most common kind of glitches caused by single particles of radiation barreling into chips.

eam member Raphael Some of JPL explains: "One way to use faster, consumer CPUs in space is simply to have three times as many CPUs as you need: The three CPUs perform the same calculation and vote on the result. If one of the CPUs makes a radiation-induced error, the other two will still agree, thus winning the vote and giving the correct result."

This works, but often it's overkill, wasting precious electricity and computing power to triple-check calculations that aren't critical.

"To do this smarter and more efficiently, we're developing software that weighs the importance of a calculation," continues Some. "If it's very important, like navigation, all three CPUs must vote. If it's less important, like measuring the chemical makeup of a rock, only one or two CPUs might be involved."
Personally.... I don't think this is the way to go. It adds complexity. In particular, instead of having a single, simple, brute-force "too simple to be incorrect" generic (program pattern or template), you have dozens, possibly hundreds of variations. Proving one correct doesn't help you prove the correctness of any of the others. Instead of having something so simple that there's obviously nothing wrong, it's so complex that anything wrong isn't obvious. This means additional time in testing, more resources for formal proof of correctness, and always more money.

Complex Software systems often fail on delivery, simply because of management rather than technical issues. Given enough time and money, everything could be tested and debugged. But often, the Time is set, it can't be altered. There is a fixed deadline (such as a launch window...) when the system must be in service. Adding additional reources doesn't help linearly, in fact, the point of diminishing returns is reached so quickly that "adding manpower to a late software project makes it later", as stated in the classic text, "The Mythical Man-Month".
This is just one of dozens of error-correction techniques that EAFTC pulls together into a single package. The result is much better efficiency: Without the EAFTC software, a computer based on consumer CPUs needs 100-200% redundancy to protect against radiation-caused errors. (100% redundancy means 2 CPUs; 200% means 3 CPUs.) With EAFTC, only 15-20% redundancy is needed for the same degree of protection. All of that saved CPU time can be used productively instead.

"EAFTC is not going to replace rad-hard CPUs," cautions Some. "Some tasks, such as life support, are so important we'll always want radiation hardened chips to run them." But, in due course, EAFTC algorithms might take some of the data-processing load off those chips, making vastly greater computer power available to future missions.
In my experience, saving computer time and power is not the problem. Compared with even a small communications module, a really complex triply-redundant huge chunk of computing power takes about 1% of the power budget, albeit 200-300% of the mass and about the same volume.

The amount of redundancy you need is given by the formula (2n+1), where n is the number of simultaneous faults. FedSat used triple redundancy, and a few techniques of my own devising such as error-correcting demons and heuristics for bitwise correction, to ensure that 2 errors would have to be close together in time, and in widely separated parts of the satellite, yet in exactly the same part of the memory map, to "fool" the error correction. Basically, anything that got past that would likely have wrecked the satellite in so many other ways, most of them likely fatal, that it didn't really matter.

What can I say - it's worked for 3 years continuously, dipping in and out of the South Atlantic Anomaly (a high radiation zone) 3 times a day, and has endured some of the worst solar weather and class-X flares ever recorded.

The point is, only one simple template was used, then automatically instantiated a few dozen times to deal with everything from stored telecommands to be executed when out of ground contact, to "housekeeping data" regarding error rates, battery voltages over time, and so on. Cost to design, test, and manage was minimal.

I'm all in favour of the "build it and see" approach used on EAFTC. But there are scaleability problems. If it were me, I'd see if I could have a much simpler universal 5x or even 7x redundancy in use for everything. OK, it's wasteful: but it does mean that when a micrometeorite hits the life support computer, you can use the geological analyser to take over without loss of reliability.

The savings in time and money would be partly offset by having to cope with an additional 2-3 kilos of equipment, and another kilo to power it. This tiny change could easily cost a million dollars in development resources, larger rocket etc. Mass and power all always at a premium. But it might save hundreds of millions in software and systems testing. Configuration management too. Who knows, a reliability analysis might show that with a "one size fits all" approach, fewer bespoke spares would be needed on long-duration missions. That means less mass, not more.

No comments: