Forgot your password?
typodupeerror
Bug Australia Software Technology

Software Bug Caused Qantas Airbus A330 To Nose-Dive 603

Posted by Unknown Lamer
from the bugs-on-a-plane dept.
pdcull writes "According to Stuff.co.nz, the Australian Transport Safety Board found that a software bug was responsible for a Qantas Airbus A330 nose-diving twice while at cruising altitude, injuring 12 people seriously and causing 39 to be taken to the hospital. The event, which happened three years ago, was found to be caused by an airspeed sensor malfunction, linked to a bug in an algorithm which 'translated the sensors' data into actions, where the flight control computer could put the plane into a nosedive using bad data from just one sensor.' A software update was installed in November 2009, and the ATSB concluded that 'as a result of this redesign, passengers, crew and operators can be confident that the same type of accident will not reoccur.' I can't help wondering just how a piece of code, which presumably didn't test its input data for validity before acting on it, could become part of a modern jet's onboard software suite?"
This discussion has been archived. No new comments can be posted.

Software Bug Caused Qantas Airbus A330 To Nose-Dive

Comments Filter:
  • by Hadlock (143607) on Tuesday December 20, 2011 @12:19AM (#38430732) Homepage Journal

    I can't help wondering just how could a piece of code, which presumable didn't test its' input data for validity before acting on it, become part of a modern jet's onboard software suit?"

    This, from the same company, while building the A380 megajet decided to upgrade half of their facilities to plant software version 5, while the other half decided to stick with version 3/4. And did not make the file formats compatible between the two versions, resulting in multi-month delays of production as a result.
     
    Point being, in huge projects, simple things get overlooked (with catastrophic results). My favorite is when we slammed a $20 million NASA/ESA probe in to the surface of mars at high speed because some engineer forgot to convert mph in to kph (or vice-versa).

    • Re:Bad software (Score:5, Informative)

      by RealGene (1025017) on Tuesday December 20, 2011 @01:03AM (#38431044)

      My favorite is when we slammed a $20 million NASA/ESA probe in to the surface of mars at high speed because some engineer forgot to convert mph in to kph (or vice-versa).

      No, it was when two different softwares were used to calculate thrust. The spacecraft software calculated thrust correctly in newton-seconds.
      The ground software calculated thrust in pounds force-seconds. This was contrary to the software interface specification, which called out newton-seconds.
      The result was that the ground-calculated trajectory was more than 20 kilometers too close to the surface.
      The engineers didn't "forget to convert", they failed to read and understand the specifications.

      • Re:Bad software (Score:5, Insightful)

        by Sir_Sri (199544) on Tuesday December 20, 2011 @02:01AM (#38431396)

        From out here it's hard to distinguish between 'forgot what the specification said they should do' and 'didn't bother to read it in the first place'. Even if your 10 testing guys knew it was in the specification doesn't mean they necessarily understood how to test it properly, and maybe did some sort of relative test (input of x should come out to be 10x in a simple example). The problem with using the wrong unit of measure is that the math is, in isolation all correct and self consistent, it's just off by a constant - which just happens to be enough to cause catastrophic failures.

        In the case of an aircraft using only once sensor in the article, did it read in data from all the sensors, and just ignored some of the input? Did it average the inputs, (which, naively, isn't a bad answer, but fails badly when you have really wonky data), was there some race condition in their resolution between multiple sensors? That's a fun one, maybe it works on data on poling intervals and in very rare cases it can read data from only one sensor and not the others and so on. Even if you know the specification it can be tricky to implement (and realize all of the things that can go wrong, it's not like all of these people doing the calculations are experts in distributed systems necessarily, they might be experts in physics and and engineering). Doing something simple like taking an average of an array can fail in really bad ways - what if the array isn't populated on time? How do you even know if the array is fully populated? How does my average handle out of bounds numbers? How about off by 10^6 numbers? Does old data just hang out in those memory addresses, and if so what happens to it? A lot of those underlying problems, especially with how the array (or in this case probably how a handful of floats) is populated and is it aware if it is properly populated are handled by the implementation of the language, which is well beyond the people who actually do most of the programming. And not everyone thinks 'hey for every line of code I need to go and check to make sure the assembler version doesn't have a bizarre race condition in it', assuming you could even find the race conditions in the first place.

  • by fche (36607) on Tuesday December 20, 2011 @12:22AM (#38430766)

    "I can't help wondering just how could a piece of code, which presumable didn't test its' input data for validity before acting on it, become part of a modern jet's onboard software suit?""

    How about reading the darned final report, conveniently linked in your own blurb? There was lots of validity checking. In fact, some of it was relatively recently changed, and that accidentally introduced this failure mode (the 1.2-second data spike holdover). (Also, how about someone spell-checking submissions?)

    • by inasity_rules (1110095) on Tuesday December 20, 2011 @12:33AM (#38430846) Journal

      Mod parent up. Anyhow, information from a sensor may be valid but inaccurate. I deal with these types of systems regularly(not in aircraft, but control systems in general), and it is sometimes impossible to tell with out extra sensors. Its one thing to detect a "broken wire" fault, and a completely different thing to detect a 20% calibration fault, for example, so validity checking can only take you so far. Its actually impressive the failure mode in this case caused so little damage.

      • by wvmarle (1070040) on Tuesday December 20, 2011 @01:11AM (#38431090)

        Agreed, valid but inaccurate.

        Though such an airliner will have more than one air speed sensor, no? Relying for such a vital piece of information on just one sensor would be crazy. And that makes it to me even more surprising that a single air speed sensor to malfunction causes such a disaster. But then it's the same kind of issue that's been blamed on an Air France jet crashing into the ocean - malfunctioning sensors, in that case ice buildup or so iirc, and as all sensors were of the same design this caused all of them to fail.

        Another thing: I remember that when Airbus introduced their fly-by-wire aircraft, they stressed that one of the safety features to prevent problems caused by computer software/hardware bugs, was to have five different flight computer systems built and designed independently by five different companies, using different hardware. So that if one computer has an issue causing it to malfunction, the other four computers would be able to override this. And a majority of those computers should agree with one another before an airplane control action would be undertaken.

        • by inasity_rules (1110095) on Tuesday December 20, 2011 @01:26AM (#38431178) Journal

          I'm sure they must have more than one sensor. Perhaps even more than one sensing principle is involved. The problem with the system of having multiple computers vote, is we tend to solve problems in similar ways, so if there is a logic error in one machine (as opposed to a typo) it is fairly likely to be repeated in at least 2 of the other machines. Some sets of conditions are very hard to predict and design for. Even in the most simple systems. I often see code (when updating a system) that does not account for every possibility because either everyone considers that combination unlikely, or nobody thought of it in the first place(until it happens of course...) Being a perfectionist in this business is very costly in development time.

          The fact is a complex system such as an aircraft could easily be beyond human capability to perfect first time. And test completely.

          • by wvmarle (1070040)

            The way to prevent typos (and with that, bugs) being copied to other systems is to make sure your systems are designed by independent companies. The chance of having the exact same bugs in two independently developed systems is really small. Make that three different systems, and set up a majority vote system to be pretty sure you've got the correct value.

            Aircraft are very complex systems indeed. Yet the results of failure are generally pretty bad, and it's hard to make an aircraft fail safe - so everything

            • That only covers mistakes like typos. The bugs independent systems do not cover are logical and conceptual errors. If the functional specification is wrong to start with, all the versions will have similar or identical issues. Fail safe is a tricky thing in an aircraft. Systems and sensors will fail. No matter how many or how well designed. If two out of three air speed sensors fail for example, then there will be big problems. The point being that there is no 100% safety.

        • Re: (Score:3, Informative)

          by jrumney (197329)

          But then it's the same kind of issue that's been blamed on an Air France jet crashing into the ocean - malfunctioning sensors, in that case ice buildup or so iirc, and as all sensors were of the same design this caused all of them to fail.

          The official report for that came out a week or so ago. The only effect that the malfunctioning sensors had in that case was to put the copilots back in control of the plane so they could proceed to attempt to climb above the limits of the aircraft, and continue to pull ba

    • by Animats (122034) on Tuesday December 20, 2011 @01:50AM (#38431322) Homepage

      How about reading the darned final report. [atsb.gov.au]

      I highly recommend that. It's a good read. This was not a sensor problem. The problem actually occurred in the message output queue of one of the CPUs, and resulted in sending data with the label for one data item with the data from another. The same hardware unit had demonstrated similar symptoms two years earlier, but the problem could not be replicated. This time, they tried really hard to induce the problem, with everything from power noise to neutron bombardment, and were unable to do so.

      There are several thousand identical hardware units in use, and one of the others demonstrated a similar problem, once. No other unit has ever demonstrated this problem. The The investigators are still puzzled. They unit which produced the errors has been tested extensively and the problem cannot be reproduced. They considered 24 different failure causes and eliminated all of them. It wasn't a stuck bit. It wasn't program memory corruption. (The code gets a CRC check every few seconds.) The code in ROM was what it was supposed to be. Thousands of other units run exactly the same software. It wasn't a single flipped bit. It wasn't a memory timing error. It wasn't a software fault. It looked like half of one 32-bit word was combined with half of another 32-bit word during queue assembly on at least some occasions. But there are errors not explained by that.

      Very frustrating.

      • by Anonymous Coward on Tuesday December 20, 2011 @02:33AM (#38431586)

        Posting anon because I moderated.

        I had a very similar problem once with firmware on a TI DSP. The symptom was that a peltier element for controling laser temperature would sometimes freak out and start burning so hot that the solder melted. After some debugging, it turned out that somewhere between the EEPROM holding the setpoint, and the AD converter, the setpoint value got corrupted.

        The cause turned out to be a 32 variable that was un-initialized, but always set to 0 by the stack initialization code.
        Only the first 16 bits were filled in because that was the value stored in the EEPROM. The programming bug was that the other 16 bits were left as is. In >99% of the time, this was not a problem. But if a specific interrupt happened at exactly the wrong moment during initialization of the stack variable, that variable was filled with garbage from an interrupt register value. Since the calculations for the setpoint used the entire 32 bits (it was integer math) it came out with a ridiculously high setpoint.

        Having had to debug that, I know how hard it can be if your bug depends on what is going on inside the CPU or related to interrupts.
        There may only be a window of less a micro second for this bug to happen, so reproduction could be nigh on impossible.

      • by perpenso (1613749) on Tuesday December 20, 2011 @02:57AM (#38431700)

        It looked like half of one 32-bit word was combined with half of another 32-bit word during queue assembly on at least some occasions. But there are errors not explained by that.

        This is why I like fuzzing. Sending random and/or corrupted data to software to evaluate the software's robustness and sensitivity to corrupted inputs. For a project like this I would like to send simulated inputs from regression tests and recorded data from actual flights to the software while fuzzing each playback, repeat. Let a system sit in the corner running such tests 24/7.

        In theory some permutation of the data should eventually resemble what you describe.

        • by MartinSchou (1360093) on Tuesday December 20, 2011 @06:20AM (#38432516)

          In theory some permutation of the data should eventually resemble what you describe.

          True ... but you may not ever have enough time to hit all the corner cases.

          If it's a single 32-bit word, that can cause the issue, then yes, you can go through every single permutation fairly quickly. There are only 4,294,967,296 of them - nothing that a computer can't handle.

          Suppose for a moment that the issue is caused, not by one single faulty piece of data, but two right after each-other. Essentially a 64-bit word causes the issue. Now we're looking at 18,446,744,073,709,551,616. Quite a bit more, but not impossible to test.

          Now suppose that the first 64-bit word doesn't cause the fault on its own, but "simply" causes an instability in the software. That instability will be triggered by another specific 64-bit word. Now we're looking at 3.40282367 x 10^38 permutations.

          Now, keep in mind that at this point, we're really looking at a fairly simple error triggered by two pieces of data. One sets it up, the other causes the fault.

          Now let's make it slightly more complex.

          The actual issue is caused by two different error conditions happening at once. If they are similar as above, we're now looking at, essentially, a 256-bit word. That's 1.15792089 x 10^77 permutations.

          In comparison, the world's fastest super computer can do 10.51 petaflops, which is 10.51 x 10^15, and it would take that computer 0.409 microseconds to go through all permutations in a 32 bit word. About 30 minutes for a 64 bit word. 10^15 years for a 128 bit word and 10^53 years for a 256 bit word.

          Yes, you can test every single permutation, if the problem is small enough. But the problem with most software is that it really isn't small.

          Even if we are only talking 32 bit words causing the issue, will it happen every time that single word is issued, or do you need specific conditions? How is that condition created? As soon as the issue becomes even slightly complex, it becomes essentially impossible to test for.

  • by holophrastic (221104) on Tuesday December 20, 2011 @12:24AM (#38430774)

    we're going to see a huge change in programming methods coming pretty soon. Today, A.I. is still math and computer based. The problem is that data, input, and all of the algorithms you're going to write can result in a plane nose-diving -- even though no human being has ever chosen to nose-dive under any scenario in a commercial flight.

    Why was an algorithm written that could do something that no one has ever wanted to do?

    The shift is going to be when psychology takes over A.I. from the math geeks. It'll be the first time that math becomes entirely useless because the scenarios will be 90% exceptions. It'll also be the first time that psychology becomes truly beneficial -- and it'll be the direct result of centuries of black-box science.

    That's when the programming changes to "should we take a nose-dive? has anyone ever solved anything with a nose-dive? are we a fighter jet in a dog fight like they were?" Instead of what is it now: "what are the odds that we should be in a nose-dive? well, nothing else seems better."

    • by RightwingNutjob (1302813) on Tuesday December 20, 2011 @12:28AM (#38430806)

      Instead of what is it now: "what are the odds that we should be in a nose-dive? well, nothing else seems better."

      Probably more like, "the sensor spec sheet says it's right 99.99999% of the time. may as well assume it's right all the time".

      The devil almost surely lives on a set of zero measure.

      • by holophrastic (221104) on Tuesday December 20, 2011 @12:37AM (#38430872)

        yup. all the while forgetting that the while altimeter shows altitude, it rarely actually measures distance to the ground, it measures air pressure, and then assumes an aweful lot.

        • by wvmarle (1070040)

          Interesting one indeed. Could be a tough measure.

          For starters: what is one's current altitude? What is your reference point? The ground level at that point? Changes quickly when passing over mountainous terrain. Or the height compared to sea level? Which is also tricky, as the earth's gravitational field is not uniform and sea level is far from a perfect flattened sphere around the Earth's centre.

          And how about GPS based altitude measurements? That's easily accurate to within a few meters, less than the size

    • by jamesh (87723) on Tuesday December 20, 2011 @01:00AM (#38431022)

      A better use of psychology will be to examine the heads of anyone who wants to throw maths out of the window and engage psychologists when designing AI algorithms.

    • we're going to see a huge change in programming methods coming pretty soon. Today, A.I. is still math and computer based. The problem is that data, input, and all of the algorithms you're going to write can result in a plane nose-diving -- even though no human being has ever chosen to nose-dive under any scenario in a commercial flight.

      There are some humans alive today who have wisely done so to the point of causing injuries to recover from stalls real and imagined.

      • absolutely. and out of all who have, look at what it took for them to choose to do so. and look at how many times it's happened. it takes a huge dicision for a pilot to decide to do it. it's not a single reading from a single instrument.

        i'd say that there's no single malfunctioning device that could get a pilot to do that. in fact, I don't think any incorrect information could do it. the only malfunction to make it happen would probably need to be in the pilot.

    • even though no human being has ever chosen to nose-dive under any scenario in a commercial flight. Why was an algorithm written that could do something that no one has ever wanted to do?

      Is that something you are saying from knowledge or just making up? I was under the impression that getting the nose pointed down was a fairly 'normal' thing for a pilot to do when faced with a stalling plane. Indeed, keeping the nose up [flightglobal.com] can be precisely the wrong thing to do.

      • lowering the nose, yes, absolutely. nose-dive, no. the kind of thing that injures passengers is not standard anything.

    • by Sarten-X (1102295)

      That's assuming that the computer knows what a "nose-dive" even is, or why it's (usually) a bad thing. It would have to know every problem, every tactic, and every risk, and nothing would actually be safer, though the program would be far more complex..

      Instead, the "psychological" program thinks "We're going a lot slower than we should for this altitude. Oh no! We're going to stall, and it's only by sheer luck that we haven't already! Why are we this high, anyway? The pilot told me to go this high, but mayb

    • Basing A.I. on psychology will be only a stop gap measure, on the way to the true solution to this sort of problems: basing A.I. on evolutionary anthropology. You see, both the crew and the passengers can be modeled as as tribe, trying to adapt their stable trajectory based culture to changing conditions, namely a nose dive. As more and more air tribes experience such disruptions to their familiar environment, you will find that some develop better coping strategies than others. After a number of generation

  • by junglebeast (1497399) on Tuesday December 20, 2011 @12:30AM (#38430826)

    I can't help wondering just how could a piece of code, which presumable didn't test its' input data for validity before acting on it, become part of a modern jet's onboard software suit?"
    ---

    I'm surprised there are people who think that we have the technology to program computers to make decisions about how to control things like airplanes better then a human being.

    Computers excel at solving mathematical problems with definitive inputs and outputs, but our attempts to translate the problem of controlling an airplane, or an organism, into a simple circuit...will necessarily be limiting.

    They can only test that the computer program will behave as expected, but there is no test to prove that the behavior we attempted to implement is actually a "good" way to behave under all circumstances.

    • by jklovanc (1603149)

      Take a look at this [wikipedia.org] incident. The autopilot did everything right except that lack of action, poor decision making and disorientation by the pilots caused a 747 to roll out of control.
      The pilots did the following things wrong;
      1. Failed to descend to correct altitude before attempting engine restart.
      2. Failed to notice the extreme inputs the autopilot was using that did not correct the roll(the pilot should have used some rudder to help the autopilot)
      3. Became fixated on the engine issue when he should have l

  • What? (Score:5, Informative)

    by Spikeles (972972) on Tuesday December 20, 2011 @12:37AM (#38430870)

    "I can't help wondering just how could a piece of code, which presumable didn't test its' input data for validity before acting on it, become part of a modern jet's onboard software suit?"" - pdcull

    What are you? some kind of person that doesn't read the actual articles or documents? Oh wait.. this is slashdot. Here let me copy paste some text for you

    If any of the three values deviated from the median by more than a predetermined threshold for more than 1 second, then the FCPC rejected the relevant ADR for the remainder of the flight.

    The FCPC compared the three ADIRUs’ values of each parameter for consistency. If any of the values differed from the median (middle) value by more than a threshold amount for longer than a set period of time, then the FCPC rejected the relevant part of the associated ADIRU (that is, ADR or IR) for the remainder of the flight.

    So there you go, there actually really was validity checking performed. Multiple times per second in fact, by three separate, redundant systems. Unfortunately all 3 systems had the bug. Here is the concise summary for you:

    The FCPC’s AOA algorithm could not effectively manage a scenario where there were multiple spikes such that one triggered a memorisation period and another was present 1.2 seconds later. The problem was that, if a 1.2-second memorisation period was triggered, the FCPCs accepted the next values of AOA 1 and AOA 2 after the end of the memorisation period as valid. In other words, the algorithm did not effectively handle the transition from the end of a memorisation period back to the normal operating mode when a second data spike was present.

  • by kawabago (551139) on Tuesday December 20, 2011 @12:38AM (#38430882)
    Airbus poached engineers from Toyota!
  • by oneblokeinoz (2520668) on Tuesday December 20, 2011 @02:00AM (#38431388)
    DISCLAIMER: I hate air travel, but do it most weeks.

    I have worked in and around the safety critical software industry for over 20 years. The level of testing and certification that the flight control software for a commercial aircraft is subjected to far exceeds any other industry I'm familiar with. (I'm willing to be educated on nuclear power control software however.)

    The actual problem on the Qantas jet was a latent defect that was exposed by a software upgrade to another system. So the bug was there for a long time and I'm sure there are still others waiting to be found. But this doesn't stop me getting on a jet at least twice a week.

    As a software professional and nervous flyer, do problems with the aircraft software scare me? No not really. What scares me is the airline outsourcing maintenance to the lowest bidder in China, the pilots not getting enough break time, the idiotic military pilot who ignores airspace protocol, and the lack of english language skills in air traffic controllers and cockpit crew across the region where I fly (English is the international standard for Air Traffic Control).

    A good friend is a senior training captain on A330's, and in all the stories he tells software is barely mentioned. What get's priority in the war-stories is the human factors and general equipment issues - dead nav aids, dodgy radios, stupid military pilots. One software story was an Airbus A320 losing 2 1/2 out of 3 screens immediately after takeoff from the old Hong Kong airport. The instructions on how to clear the alarm condition and perform a reset were on the "dead" bottom half of one of the screens.

    A great example of software doing it's job is the TCAS system - Traffic Collision Avoidance System (http://en.wikipedia.org/wiki/Traffic_collision_avoidance_system). To quote my friend "If it had lips, he'd kiss it". It's saved his life, and the lives of 100's of passengers, at least twice. Both times through basic human error on the part of the pilot of the other aircraft.

    One final thought - on average about 1000 people die in commercial aviation incidents each year world wide (source: aviation-safety.net) . In the USA, over 30,000 people die in vehicle accidents every year.
  • by angel'o'sphere (80593) on Tuesday December 20, 2011 @04:39AM (#38432100) Homepage Journal

    injuring 12 people seriously and causing 39 to be taken to the hospital.
    That is why you keep your safty belt shut.
    If you don't like the feeling, losen it a bit, but keep it closed.
    I really wonder why people keep taking such nonsense risks and open the seat belt directly after launch.

  • by Just Brew It! (636086) on Tuesday December 20, 2011 @09:37AM (#38434172)

    If you saw the procedures required to get airworthiness certification from the FAA for a critical piece of software, you would shake your head in disbelief. It is almost all about ensuring that every line of code is traceable to (and tested against) a formal requirement somewhere. In spite of greatly increasing software development costs (due to the additional documentation and audit trails required), the procedures do amazingly little to ensure that the requirements or code are actually any good, or that sound software engineering principles are employed. It does not surprise me that GIGO situations occasionally arise -- it is perfectly plausible that a system could meet the certification criteria but shit's still busted because the formal requirements didn't completely capture what needed to happen.

    The cost of compliance can also warp the process. A co-worker once told me a story about an incident that happened years ago at a former employer of his. A software system with several significant bugs was allowed to continue flying because the broken version had already received its FAA airworthiness certification. A new version which corrected the bugs had been developed, but getting the new version through the airworthiness certification process again would've been too costly; so the broken version was allowed to continue flying.

    Look up "DO-178B" sometime if you're curious...

Put your Nose to the Grindstone! -- Amalgamated Plastic Surgeons and Toolmakers, Ltd.

Working...