Forgot your password?
typodupeerror
Bug Australia Software Technology

Software Bug Caused Qantas Airbus A330 To Nose-Dive 603

Posted by Unknown Lamer
from the bugs-on-a-plane dept.
pdcull writes "According to Stuff.co.nz, the Australian Transport Safety Board found that a software bug was responsible for a Qantas Airbus A330 nose-diving twice while at cruising altitude, injuring 12 people seriously and causing 39 to be taken to the hospital. The event, which happened three years ago, was found to be caused by an airspeed sensor malfunction, linked to a bug in an algorithm which 'translated the sensors' data into actions, where the flight control computer could put the plane into a nosedive using bad data from just one sensor.' A software update was installed in November 2009, and the ATSB concluded that 'as a result of this redesign, passengers, crew and operators can be confident that the same type of accident will not reoccur.' I can't help wondering just how a piece of code, which presumably didn't test its input data for validity before acting on it, could become part of a modern jet's onboard software suite?"
This discussion has been archived. No new comments can be posted.

Software Bug Caused Qantas Airbus A330 To Nose-Dive

Comments Filter:
  • by engun (1234934) on Tuesday December 20, 2011 @01:36AM (#38430862)
    Your post is full of FUD. A million people die annually because of human drivers. A driverless car killing half that many would still be an improvement.
    www.un.org/ar/roadsafety/pdf/roadsafetyreport.pdf
  • What? (Score:5, Informative)

    by Spikeles (972972) on Tuesday December 20, 2011 @01:37AM (#38430870)

    "I can't help wondering just how could a piece of code, which presumable didn't test its' input data for validity before acting on it, become part of a modern jet's onboard software suit?"" - pdcull

    What are you? some kind of person that doesn't read the actual articles or documents? Oh wait.. this is slashdot. Here let me copy paste some text for you

    If any of the three values deviated from the median by more than a predetermined threshold for more than 1 second, then the FCPC rejected the relevant ADR for the remainder of the flight.

    The FCPC compared the three ADIRUs’ values of each parameter for consistency. If any of the values differed from the median (middle) value by more than a threshold amount for longer than a set period of time, then the FCPC rejected the relevant part of the associated ADIRU (that is, ADR or IR) for the remainder of the flight.

    So there you go, there actually really was validity checking performed. Multiple times per second in fact, by three separate, redundant systems. Unfortunately all 3 systems had the bug. Here is the concise summary for you:

    The FCPC’s AOA algorithm could not effectively manage a scenario where there were multiple spikes such that one triggered a memorisation period and another was present 1.2 seconds later. The problem was that, if a 1.2-second memorisation period was triggered, the FCPCs accepted the next values of AOA 1 and AOA 2 after the end of the memorisation period as valid. In other words, the algorithm did not effectively handle the transition from the end of a memorisation period back to the normal operating mode when a second data spike was present.

  • Re:Bad software (Score:5, Informative)

    by RealGene (1025017) on Tuesday December 20, 2011 @02:03AM (#38431044)

    My favorite is when we slammed a $20 million NASA/ESA probe in to the surface of mars at high speed because some engineer forgot to convert mph in to kph (or vice-versa).

    No, it was when two different softwares were used to calculate thrust. The spacecraft software calculated thrust correctly in newton-seconds.
    The ground software calculated thrust in pounds force-seconds. This was contrary to the software interface specification, which called out newton-seconds.
    The result was that the ground-calculated trajectory was more than 20 kilometers too close to the surface.
    The engineers didn't "forget to convert", they failed to read and understand the specifications.

  • by jrumney (197329) on Tuesday December 20, 2011 @02:47AM (#38431310) Homepage

    But then it's the same kind of issue that's been blamed on an Air France jet crashing into the ocean - malfunctioning sensors, in that case ice buildup or so iirc, and as all sensors were of the same design this caused all of them to fail.

    The official report for that came out a week or so ago. The only effect that the malfunctioning sensors had in that case was to put the copilots back in control of the plane so they could proceed to attempt to climb above the limits of the aircraft, and continue to pull back on the stick all the way down to sea level after the stall warning started blaring.

  • by Animats (122034) on Tuesday December 20, 2011 @02:50AM (#38431322) Homepage

    How about reading the darned final report. [atsb.gov.au]

    I highly recommend that. It's a good read. This was not a sensor problem. The problem actually occurred in the message output queue of one of the CPUs, and resulted in sending data with the label for one data item with the data from another. The same hardware unit had demonstrated similar symptoms two years earlier, but the problem could not be replicated. This time, they tried really hard to induce the problem, with everything from power noise to neutron bombardment, and were unable to do so.

    There are several thousand identical hardware units in use, and one of the others demonstrated a similar problem, once. No other unit has ever demonstrated this problem. The The investigators are still puzzled. They unit which produced the errors has been tested extensively and the problem cannot be reproduced. They considered 24 different failure causes and eliminated all of them. It wasn't a stuck bit. It wasn't program memory corruption. (The code gets a CRC check every few seconds.) The code in ROM was what it was supposed to be. Thousands of other units run exactly the same software. It wasn't a single flipped bit. It wasn't a memory timing error. It wasn't a software fault. It looked like half of one 32-bit word was combined with half of another 32-bit word during queue assembly on at least some occasions. But there are errors not explained by that.

    Very frustrating.

  • by oneblokeinoz (2520668) on Tuesday December 20, 2011 @03:00AM (#38431388)
    DISCLAIMER: I hate air travel, but do it most weeks.

    I have worked in and around the safety critical software industry for over 20 years. The level of testing and certification that the flight control software for a commercial aircraft is subjected to far exceeds any other industry I'm familiar with. (I'm willing to be educated on nuclear power control software however.)

    The actual problem on the Qantas jet was a latent defect that was exposed by a software upgrade to another system. So the bug was there for a long time and I'm sure there are still others waiting to be found. But this doesn't stop me getting on a jet at least twice a week.

    As a software professional and nervous flyer, do problems with the aircraft software scare me? No not really. What scares me is the airline outsourcing maintenance to the lowest bidder in China, the pilots not getting enough break time, the idiotic military pilot who ignores airspace protocol, and the lack of english language skills in air traffic controllers and cockpit crew across the region where I fly (English is the international standard for Air Traffic Control).

    A good friend is a senior training captain on A330's, and in all the stories he tells software is barely mentioned. What get's priority in the war-stories is the human factors and general equipment issues - dead nav aids, dodgy radios, stupid military pilots. One software story was an Airbus A320 losing 2 1/2 out of 3 screens immediately after takeoff from the old Hong Kong airport. The instructions on how to clear the alarm condition and perform a reset were on the "dead" bottom half of one of the screens.

    A great example of software doing it's job is the TCAS system - Traffic Collision Avoidance System (http://en.wikipedia.org/wiki/Traffic_collision_avoidance_system). To quote my friend "If it had lips, he'd kiss it". It's saved his life, and the lives of 100's of passengers, at least twice. Both times through basic human error on the part of the pilot of the other aircraft.

    One final thought - on average about 1000 people die in commercial aviation incidents each year world wide (source: aviation-safety.net) . In the USA, over 30,000 people die in vehicle accidents every year.
  • Re:Bad software (Score:3, Informative)

    by RealGene (1025017) on Tuesday December 20, 2011 @03:19AM (#38431496)

    They were just using a different set of units for the internal calculation, and then got bit by precision problems in the conversion. Basically, it was a rounding error. Theoretically, the details of the math could be fairly arbitrary as a "black box API." They just needed an infinite number of bits and it would have worked fine..

    No, they were taking a value in N-s but interpreting it as lbf-s. This was not rounding error, all ground calculations
    were off by a factor of > 4.

  • by syousef (465911) on Tuesday December 20, 2011 @03:50AM (#38431682) Journal

    I wouldn't be calling other people stupid. Altimeters are also used to maintain aircraft separation around busy airports, avoid bad weather etc. Your assertion that everything other than not hitting the ground is a use that is "just for fun" is ridiculous.

  • by Anonymous Coward on Tuesday December 20, 2011 @04:35AM (#38431826)

    In the AF477 accident, the junior copilot panicked under the sudden workload and kept pulling back on the control stick until the plane was beyond any hope. Neither of the copilots nor the captain (who was absent from the flight deck when the problems started) figured out this until the plane had stalled irrecoverably into a 60 degree angle-of-attack stall.

  • Re:Boeing vs Airbus (Score:5, Informative)

    by mjwx (966435) on Tuesday December 20, 2011 @05:05AM (#38431954)

    Interesting here would be some statistics. How many Boeings have come into serious trouble, and how many Airbuses?

    Besides the GP's point about Airbus pilots being unable to override the computer being complete and utter bollocks (Airbus' still have a analouge actuator control (Electronic) in them), there have been a few near misses which if it were not possible to take manual control would have resulted in a crash such as the JetBlue landing at LAX in 05.

    On the other hand, there have been incidences with Boeing aircraft which are believed would have been solved by automated systems such as AA flight 965 (Colombia 1995) where if the airbrake was automatically retracted the pilot would have been able to climb a way safely.

    Here is a good post on the subject. [askcaptainlim.com] According to the ASTB who conducted the investigation there have only been 3 such incidents in 128 Million hours of A330 operation as of 2008. That is a damn good rate of failure wouldn't you say? Pilot error being the cause of approx 48% of all accidents, Airbus or Boeing. Modern aircraft are getting safer all the time, they see more mechanics and engineers in a week then your car will see in its entire lifetime. Everything is checked and double checked, anything suspicious gets replaced. I never think I'm in danger stepping onto an an Airbus or Boeing aircraft.

    The whole Airbus Vs Boeing argument is a dick pulling contest between biased pilots. It's like a Xbox/PS3 fanboy war. Utterly senseless to third party observers (and bronzed fingered PC gamers) Now amongst the 25 worst airlines you have a 1 in 850,000 chance of dying and I dont fly any of those airlines (1 in 9.2 million for the 25 best), hence my practice of congratulating myself at the check in counter as I've survived the most dangerous part of air travel, the drive to the airport. Compared to our road toll, our air toll is minuscule.

  • by Richard_at_work (517087) <richardprice.gmail@com> on Tuesday December 20, 2011 @05:53AM (#38432156)

    What a load of uninformed bullshit - Airbus has several levels of computer control, called laws, one of which is Direct Law which passes all inputs directly to the control surfaces. And if that isn't enough, they have mechanical backup controls for all surfaces on the flight deck, so even with a completely dead computer the aircraft is still flyable.

    You sir, are talking complete shit, but that seems to be normal when someone wants to put Boeing on a pedestal over Airbus.

    Let's go over some of your "mistakes"...

    The 787 isn't Boeings first FBW aircraft, they have had one flying since the mid 1990s with the 777. The 787s system is an evolution of the 777s.

    AF447 didn't crash because of a computer problem, it crashed because of poor crew relationships in the cockpit - three pilots in that cockpit and not one was interested in what the others were doing. They didn't run basic check lists, they ignored other information, and the pilot flying did completely the wrong thing - the situation was completely survivable if they had carried out the correct procedures, except they didn't. The crash wasn't caused by the computer, it was caused by the pilot taking a stable aircraft and stalling it badly when nothing about the computer error forced him to do that.

  • by michelcolman (1208008) on Tuesday December 20, 2011 @06:34AM (#38432338)
    Except that the designers of the software didn't take all possible situations into account. For example, any Fly By Wire Airbus will automatically pitch up if speed increases too far above the maximum airspeed, even when flown manually. This may be a good idea when the airplane is diving (the most likely cause for overspeed), but not when it's straight and level with other traffic immediately above! This has already made several Airbus planes in heavy turbulence suddenly start to climb violently due to a sudden change in airspeed or temperature and overriding the pilot's MANUAL inputs while he's trying to avoid flying into other traffic! That's insane, and it's only one of many reasons why I can't wait to get off the Airbus fleet onto a more sensibly designed plane. (I'm currently an A320 pilot).
  • by michelcolman (1208008) on Tuesday December 20, 2011 @09:21AM (#38433296)

    If I had written VMO or MMO instead of "maximum airspeed", you wouldn't have understood what I wrote. Airplanes do have a maximum airspeed (airspeed being the speed relative to the air, as opposed to ground speed). Go too far above VMO, and the plane starts buffeting (a kind of vibration). Go a bit further, and you may lose control completely due to high speed stall, mach tuck, control reversal, etc...

    Airbuses do indeed have autothrottles, but engines react rather slowly so, while indeed reducing thrust, the flight control systems pull the nose up as well. They have in one recent incident in my current company, and this had already happened before in several other companies. In one case, there was another plane 1000 feet above and the pilots managed to stop the climb after 700 feet.

    There are many possible reasons for a sudden increase in airspeed. Most of the time, it's due to a change in wind. If a 100 knot tailwind suddenly drops to 70 knots, you've just gained 30 knots of airspeed. But the true airspeed doesn't even have to change: in a recent incident in my company, the outside temperature changed by more than 10 degrees in a very short time which increased the mach number above MMO (because the speed of sound changes with temperature). The autopilot immediately disconnected and the flight control computers started a rather violent climb which the pilots could only recover from after climbing more than 500 feet.

    So, you say you're a rocket surgeon? What kind of operations have you performed on them?

  • by Richard_at_work (517087) <richardprice.gmail@com> on Tuesday December 20, 2011 @12:49PM (#38436186)

    Sure - the flight in question was British Midland Flight 92.

    The situation I mentioned came about because the pilots shut down the right engine due to vibration and presence of smoke in the cabin - up to the 737NG line cabin air was only taken from the right engine, so thus engine vibration and smoke in the cabin meant the right engine was at issue. However, in this case it was the left engine which had actually failed.

    While the left engine was failing, the autothrottle automatically adjusted the fuel flow into it in order to maintain the thrust levels from it - this had the effect of causing an asymmetric thrust selection between the left and right engines - however, the throttle lever actuator only selected a physical position for the right hand engine at its lower thrust level, meaning the autothrottle was actually selecting a higher value which was not indicated in the positions of the throttles (the two throttle levers are linked by one actuator when under autothrottle control, thus they can actually only show the thrust level of one of the engines - in this case, the right engine).

    When the pilots turned off the autothrottle to power down the right engine, the left engines selected thrust returned to that of the physical position of its thrust lever - which had the effect of reducing the vibration to the point where the flight crew thought they had indeed turned off the correct engine when they shut down the right engine.

    When they were on approach to Midlands Airport, they increased thrust on the left engine, which caused it to fail completely and thus the aircraft crashed short on approach.

    This system was highlighted in the crash report, along with a number of other issues with the 737NG design - Boeing did infact have to ground a large number of aircraft before the solution was deployed to the delivered fleet. It was not the sole cause of the crash, but it was something that was heavily highlighted in the chain of events.

If a camel is a horse designed by a committee, then a consensus forecast is a camel's behind. -- Edgar R. Fiedler

Working...