Airbus A350 Software Bug Forces Airlines To Turn Planes Off and On Every 149 Hours (theregister.co.uk) 131
An anonymous reader quotes a report from The Register: Some models of Airbus A350 airliners still need to be hard rebooted after exactly 149 hours, despite warnings from the EU Aviation Safety Agency (EASA) first issued two years ago. In a mandatory airworthiness directive (AD) reissued earlier this week, EASA urged operators to turn their A350s off and on again to prevent "partial or total loss of some avionics systems or functions." The revised AD, effective from tomorrow (26 July), exempts only those new A350-941s which have had modified software pre-loaded on the production line. For all other A350-941s, operators need to completely power the airliner down before it reaches 149 hours of continuous power-on time.
Concerningly, the original 2017 AD was brought about by "in-service events where a loss of communication occurred between some avionics systems and avionics network" (sic). The impact of the failures ranged from "redundancy loss" to "complete loss on a specific function hosted on common remote data concentrator and core processing input/output modules." In layman's English, this means that prior to 2017, at least some A350s flying passengers were suffering unexplained failures of potentially flight-critical digital systems.
Concerningly, the original 2017 AD was brought about by "in-service events where a loss of communication occurred between some avionics systems and avionics network" (sic). The impact of the failures ranged from "redundancy loss" to "complete loss on a specific function hosted on common remote data concentrator and core processing input/output modules." In layman's English, this means that prior to 2017, at least some A350s flying passengers were suffering unexplained failures of potentially flight-critical digital systems.
Huh.... (Score:5, Funny)
Who knew Airbus was running Windows 95?
Re:Huh.... (Score:5, Funny)
Re:Huh.... (Score:5, Funny)
I left it sitting at the blue screen for 148 hours.
Re:Huh.... (Score:5, Funny)
Re:Huh.... (Score:5, Insightful)
Re: (Score:2)
True, it's also something commonly used in all highly resilient platform.
You design and develop the best you can, and despite the best quality tests, you will always have some issue, so you integrate failure in your design, ending up with a way to return to a known state after a number of transactions (reboot in that case). And after years in the software industry, that's surely the only way to get the results you expect. And if you design for this reboot, all is fine, it's not even visible to the end user.
Re:Huh.... (Score:5, Insightful)
Rebooting, under the name "rejuvenation", is actually a standard technique for maintaining the integrity of high-assurance systems
Using a "reboot" to ensure things are as you think are is one thing. Being forced to reboot because your shit will break if you don't is another. They are NOT equivalent. One is nice, but not necessary while the other is absolutely necessary or it will be guaranteed to fail.
Totally different and nowhere NEAR the same thing.
Re: (Score:2)
Actually, they are. Let's say you have a system where an integrated component degrades by 10% every 24 hours, or at least a fixed 8-12% degradation in a 24-hour time period (lots of evaluation and modelling elided). This means you can run it for about a week before you experience a fault. Your fault isolation and recovery process then is to specify a rejuvenation interval of, say, 48 hours to deal with it and you're fine. The risk is mitigated and you move on.
Rebooting because $warm_fuzzies is voodoo. K
Re: (Score:2, Offtopic)
149 hours is 536,400 seconds. That exceeds the 512k of RAM.
Re:Huh.... (Score:5, Informative)
FWIW: 149 hours is an unsigned 32 bit number counting at 8kHz
it IS windows 95... (Score:1)
Re:Huh.... (Score:4, Informative)
Re: (Score:3)
Did anybody ever trigger that bug ? My Windows 95/98 was crashing 4 times a day, never mind having it on for 49 days.
Re: (Score:2)
Did anybody ever trigger that bug ? My Windows 95/98 was crashing 4 times a day, never mind having it on for 49 days.
I ran in into it all the time at work since we had ad hoc servers running Windows 95/98.
Re: (Score:2)
>Did anybody ever trigger that bug ?
Not during testing before it was released . .. .
Even once out, it took a while before anyone put together that hard upper limit of 50 days for systems that *did* stay up . . .
hawk
Re: (Score:2)
I managed to see it in action, on a PC that did nothing but run a scanner that was only used only sporadically. When I figured out that it was close to the 49.7 mark, I made sure I was there to watch it go down in flames. I was expecting to see it bluescreen at the right moment, but alas nothing happened! I then tried the mouse, and as soon a I clicked on something it went BSOD.
I've also managed to get a Windows Vista system all the way to the 497 day bug that doesn't actually crash the computer, but end
Re: (Score:2)
I think the longest uptime I ever got from 95/98 was a little over 3 days. I'd turn off the computer at night because if you left it on, it was a 5
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
LOL (Score:1)
Still running Win95 I see. What a joke.
Re:Who knew? (Score:4, Funny)
If it was running Linux, the pilots would all be bragging about their uptimes.
Re: (Score:2)
While writing a bash script to fly to the destination.
....more geek stuff (Score:2)
And writing self-modifying Perl one-liner to handle the speaker announcement.
And complaining about their meals not being perfectly cooked and suspecting this being due to the micro-wave running a systemd-based linux distro.
Re: (Score:2)
Damn nerds.
Re: (Score:2)
If it was running Linux, the pilots would all be bragging about their uptimes.
737 Max pilots would not be.
Huh (Score:2, Insightful)
Maybe they should ground them all until they actually fix the problem.
What is with these airplane manufacturers and their seemingly blasé attitude towards flight safety?
Re: (Score:1)
Did you expect regulations and government agencies to protect you?
Re: (Score:1)
lolz right, "don't forget to reboot the thing every 8th day or you may fall out of the sky"
yup, PROTECTION
what they're protection is SHAREHOLDER VALUE
Re: (Score:1)
No plane flight lasts 8 days. Just reboot before takeoff. Add it to the checklist.
Re: (Score:2)
No plane flight lasts 8 days. Just reboot before takeoff. Add it to the checklist.
Setting uptime records doesn't seem to be something really positive in this environment anyway. I don't want my plane to be spending any nines in the air anyway.
Re: (Score:1)
What have jumbo jets got to do with this? Please, stay out of conversations you don't understand.
Re: (Score:1)
I am sure all the families with loved ones buried by the airlines are cheering you on!
If the bug was recent, perhaps you would have a leg to stand on, but according to the article bugs like these have been for multiple years.
There is a difference between protecting someone and warning them to not walk over a cliff. Protection takes action and enforcement, not words, bulletins, warnings, and alerts.
Even the Max had to crash twice before they did anything. I would say your faith in government is misplaced.
Re: (Score:2)
Infinitely more so than the companies themselves.
Just a hair before 2^29 milliseconds (Score:1)
2^29 milliseconds is 149 hours and 470.912 seconds. Perhaps an overflow?
Re: (Score:2)
Almost certainly. Chances are device 1 is sending data to device 2; and each packet has a timestamp that must be strictly increasing. Device 1 generates that timestamp by using a counter that overflows, so it starts sending out packet with timestamps around 0, 1, 2, ... etc again. Device 2 then says, "hey, I've already received packet 4 billion, so I should ignore packet 1" and suddenly device 1 is being ignored by the network.
Ways around include: Increasing the size of the timestamp (2^64 milliseconds i
Re: Just a hair before 2^29 milliseconds (Score:2)
The correct approach is to have a protocol with multiple message types, one of which would be a message indicating a serial number reset. That's absolutely standard practice in high volume message systems. So standard that it's unlikely that message out of sequence is even related to the problem.
Re: Still not as bad (Score:1)
Well, the 787 Dreamliner, a completely new plane, not the death trap of reusing the old airframe and sticking new necells like 737 Max 8 is, had a similar problem.
It had to be turned off and back on again every 248 days.
So it seems FAA is not doing a good job in certifying the planes....
https://it.slashdot.org/story/15/05/02/1240222/long-uptime-makes-boeing-787-lose-electrical-power
IT Crowd (Score:5, Funny)
Have you tried turning your plane off and on again?
Re:IT Crowd (Score:5, Interesting)
I was once a pax on a CRJ (aka barbie-jet) and as we were preparing to head onto the runway, the pilot comes onto the intercom and goes "so folks, we're getting something unexpected from our flight computers. We're going to reboot the jet and see what happens." so they power cycled the jet, and off we went. Didn't give a whole lot of confidence.
Re: (Score:3)
Hah! Tesla owners reboot their cars while driving! Press and hold both steering wheel "nipples" and wait for the screen to turn off.
Re: (Score:1)
I have one (Model 3), and have done this (reboot while driving), it's freaky when the screen goes black and all the sounds stop.
BUT, it's only the entertainment and instruments that go dark. All the driving functions still operate normally; brakes, steering, headlights, brake lights, turn signals, etc.
If you want to do a full power off reset (yes, it's a thing) you have to be parked.
Re: (Score:2)
...oh, hi Mom... uh.. uh hu.... yeah.. have you tried turning it on and off again?
Needs more (Score:2)
Re:Needs less (Score:2)
That's Comforting (Score:2)
Nice. Now I have to also try not to worry about if they rebooted the plane in the last 149 hours the next time I fly.
--
I believe that tomorrow is another day and I believe in miracles. -- Audrey Hepburn
Well, it's a plane (Score:5, Funny)
You don't really want 100% uptime anyway.
Re: (Score:3)
You don't really want 100% uptime anyway.
Up time... I see what you did.
Re: (Score:2)
On the other hand, you don't want rapid unplanned downtime either...
It's TIME FOR MORE QA with aircraft software (Score:2)
It's TIME FOR MORE QA with aircraft software.
Maybe even some interdependent ones mandated by the faa.
and how bad will the 2038 bug be? (Score:2, Interesting)
and how bad will the 2038 bug be?
Re: (Score:2)
and how bad will the 2038 bug be?
**WILL**?? What makes you think this isn't an early indication of it?!?
2^29 miliseconds? Boeing still in the 32 bit era? (Score:2)
Re:2^29 miliseconds? Boeing still in the 32 bit er (Score:5, Informative)
1. The plane in question is manufactured by Airbus not Boeing.
2. Of course they are using 32 bit. It is an embedded system, most processors are 32 bit ARMs or 8 bit chips.
3. The plane in question started manufacture in 2010 (Wikipedia), the subsystem design would have preceded this by years. Arm didn't release their 64bit architecture until 2011.
4. A 32bit count of milliseconds corresponds to 49 days, a long way from 149 hours. It does correspond with a 32bit counter and an 8kHz clock though.
Re: (Score:2)
Free Solution: Run this code (Score:4, Funny)
#!/bin/bash
line="echo * */149 * * *
(crontab -u root -l; echo "$line" ) | crontab -u root -
Re: (Score:1, Insightful)
My solution:
#!/bin/bash
line="echo * */149 * * * /sbin/reboot"
(crontab -u root -l; echo "$line" ) | crontab -u root -
That will never be approved by FAA. The software in a plane has to be reliable, which is why it has to meet some rather strict requirements. A plane is only allowed to start programs and allocate or free memory while on the ground. We can't have a plane crash because malloc failed. You use "dangerous" commands like echo, which is actually not allowed.
If you are puzzled to why you can't use echo on a plane due to safety, but you can fly with a plane, which crashes after 149 hours, then that's the airline ind
Re: (Score:3)
That will never be approved by FAA. The software in a plane has to be reliable, which is why it has to meet some rather strict requirements. A plane is only allowed to start programs and allocate or free memory while on the ground. We can't have a plane crash because malloc failed. You use "dangerous" commands like echo, which is actually not allowed.
Could it be more obvious that I was joking??
The solution is obvious: Fix the goddamned code.
Re: (Score:1)
Before/After (Score:2)
> Rebooted after exactly 149 hours
> ...rebooted before it reaches 149 hours
I really, really wish people would stop doing that.
goes to show (Score:2)
this just goes to show that no matter where software is being used, it's all rubbish.
it's funny because i have written a lot of programs/tools and ofcourse it also contains bugs, mostly corner cases, but that's no excuse because when you hit such a case you've got problems.
for the few times this happens, my boss pulls out this speech about how much better software is developed for cars & airplanes and that we should try to mimic the same quality, etc.
as far as i can see, the quality isn't much better th
Article ommits key facts - FUD (Score:3, Interesting)
The orignal Airworthiness Directive (linked in the article) reference a Service Bulletin which defines the necessary updates.
Basically there is a patch available since 14 August 2018. The Directive does no longer apply as soon as an airline installs those patches. Seems like Boeing is trying to spread FUD...
Quotes from the Directive:
Re: (Score:2)