Thursday, 23 September 2004

Crashproof Code

SPECTRUM, the IEEE magazine, has an interesting article on Crashproof Code. It had better be crashproof: it's the avionics software for the new model of the F/A-18 fighter.

F/A-18 Test BedTo give you an idea of the types of things you have to do when writing 100% no-fail code, have a look at the picture on the right. See the small grey box in the bottom-right of the picture? The one about the size of a briefcase? That's the segment of the system that's being tested. All the rest is part of a Test Bed, purely and simply for verifying that the thing works as advertised. This graphically illustrates the effort required in testing such systems. What you can't see is all of the code that's been written as part of the Test Bed. It's not unusual to have 1 line of code requiring 2-5 lines of unit testing code, then another 2 lines of configuration item testing code, and another 2 lines of system testing code on an emulator (when the hardware isn't available), and another few lines of system testing code for the live system (real hardware) module test - which is what's shown on the picture - and another line or so for the live system integration test, with everything put together and running on the real aircraft. Fortunately, a lot of this code can be re-used, and in the line count above, I'm only counting new lines, not re-used ones. Test beds can easily be 10 or more times the size of the system being tested.

In spacecraft, which have very expensive and 'clean-room-only' flight hardware, there may even be an intermediate step - with a 'flight model' which will actually go into space, and a 'groud model' which is identical, but doesn't require clean-room conditions and can be worked with more conveniently.

Of course once everything's working on the ground model, you then load it onto the flight model and put it through a thermal-vacuum torture chamber, which mimics conditions in LEO (except for the radiation). Sorry, I'll provide subtitles - LEO is Low Earth Orbit.

From the SPECTRUM article:
The flight software has to keep track of the plane's speed, altitude, and attitude while monitoring the pilot's controls for commands. Based on a set of rules known as control laws, the software must then translate any commands from the pilot into movements of the aircraft's various control surfaces, such as the rudders or, most significantly, the flaps that flex the AAWs. And this all has to happen fast enough that the plane responds instantly to the pilot and reliably enough that he can bet his life it will work all the time, every time.

Despite its complex and critical job, the flight software is compact, consisting of only about 13 000 lines of source code written in the Ada language. When compiled, the code fits into approximately 160 kilobytes. Compare this with the millions of lines of code that compile into tens of megabytes for a modern Web browser or word processor.
It's not that the software's task is easier than a word processor's : far from it, the task is much harder and more complex. It's only the solution that's simpler (though most emphatically not easier!).

That's why people like myself get a bit steamed up about the quality, or lack thereof, in most commercial software. It's bloated, it's buggy, it's innefficient, and it actually takes longer to develop that way due to its bugginess. Make a change and something will break - which may be due to the change you've made having a problem, or it may be a long-standing problem never revealed before. Fix that, and you've made another change, which can lead to more problems.... all programmers are familiar with this phenomenon.

Getting back to the article:
Flight software is fundamentally different from the type of software most of us encounter on the desktop, or even the software that runs such enterprise-class applications as banking databases, and not just in size. For one thing, flight software must operate in real time. We're all used to the spinning hourglasses and watches that appear regularly on our computer screens; they're telling us that the print preview or new spreadsheet we just asked for is on its way but that the computer doesn't know quite when (if ever) it will appear.

The problem goes beyond the vagaries of office software. It is fundamental to many of the operating systems used in general-purpose computers—they have no way to guarantee how long a given task will take. Of course, most of the time this unreliability isn't a problem. If your media player has to drop a few frames of a movie because the video couldn't be processed fast enough, or if it takes 150 milliseconds to select an e-mail when it normally takes 10, you're not going to notice. The worst-case scenario, when the computer completely hangs, is usually just a blip in the workday and cured by a quick reboot.

We don't have that luxury. We have to guarantee that when the flight computer starts calculating how far an elevator should move in response to a command by the pilot, the job will be finished quickly enough so that the computer has enough time to calculate where all the other control surfaces should be and still appear to be responding instantly to the pilot's wishes. This is mission-critical, real-time operation.
And in Space, no-one can press CTRL-ALT-DEL. That's one of the reasons why I am less than enthused with the push towards using COTS (Commercial Off The Shelf) operating systems for real-time work.
We didn't attempt to test the flight software all in one piece. In that situation, when problems arise, trying to pinpoint the error in the software is almost impossible. Instead, with testing already in mind, we created the flight software as a collection of about 450 independent modules. Each module was responsible for performing one or more simple functions, such as checking the position of the pilot's control stick or computing what position an elevator should be in.
"Design for Testability". When doing the basic architecture, two things have to be kept in mind: it must be testable, and it must be buildable. The latter means that you have to tailor your technical solution towards the resources you've got.

If you have 7 teams, try to have exactly 7 "configuration items" (CIs), corresponding to the output of each team. 7 is a minimum : if you have less, you're covering up buildability issues and inter-team communication problems that should be exposed. Putting a bandage on a gangrenous wound. Of course, why did you have 7 teams in the first place? That in itself is worthy of its own article, it's a complex issue determined at least as much by what human resources you have as what the technical problems are. If all you have is a hammer, try to make the problem look like a nail.

So from the top-down, the high level CIs are based on management and buildability issues (and also if theres any areas of special technical risk, but I'm trying to keep this as simple as possible).

F/A-18 in flightFrom the bottom up, your "Units" should be based on testability: break up the system into components where each component or unit is the smallest possible segment that it makes sense to test independantly. Usually a "chunk" of the system will do more than one thing. Even with the system above, something as simple as "checking the position of the pilot's control stick" will have functions for determining whether the result makes sense (or is the consequence of a failure of some sort), possible smoothing of minute deviations, reporting position of the control stick, rate-of-change and rate-of-change-of-rate-of-change (ie 2nd and 3rd derivatives wrt time), a "heartbeat" that just indicates to a master controller that the module/unit is still working and doesn't need re-setting, and probably a diagnostics test. Now all of these things could be in their own separate units: but if so, the number of units would explode - and so would the costs, and the amount of inter-unit communication, and the difficulty of making it work. So you compromise. That's why I used the phrase "that it makes sense" to test independantly.

This stuff is hard. But the reward you get when the bloody thing works.. that's priceless.

No comments: