Windows Upgrade, FAA Error Cause LAX Shutdown 862
fname writes "The recent shutdown of LAX due to an FAA radio outage was apparently caused by a Windows 2000 integration flaw, possibility related to an old Windows 95 bug. An article at the LA Times claims that the outage was caused by human error, as the system will automatically shut down after 49.7 days (related to this Windows 95 flaw?), and a technician didn't reboot the system monthly as he should have. This happened after an upgrade from Unix to Windows. I don't think blame should be assigned to the technician who missed the task; rather, it seems a gross oversight for the FAA to guarantee that such a critical system will crash after only one missed maintenance task. Who's really at fault?"
A hit for the other team... (Score:4, Interesting)
This mess was big enough that there's a large enough supply of blame to give some to everybody involved.
- No system should require a manual reboot on a regular basis... there should at least be a script capable of accomplishing that. But somehow, one got implemented. Blame whoever bought it.
- Windows shouldn't have had a faw that required monthly reboots. Blame Microsoft.
- Somebody should have done the reboots like they were told to. Blame that poor smuck.
Bottom line is that everybody's at fault because had any one piece in the chain done their job properly the failure wouldn't have happened, but a cascade of mistakes lead to the ball hitting the grass instead of a glove.
Re:Anyone want to clue them in to scheduled jobs? (Score:4, Interesting)
I've got an NT server that would hang after 2 weeks. I set up an at job to restart that service nightly and do not have that problem.
I've also got several linux servers that just plain run (and some NT/2000 servers as well).
That being said, rebooting sometimes does clear up many evils. We have a speakerphone (around 10 years old - no OS) that just wouldn't work one day. After looking at it, I unplgged it and plugged it back in (I rebooted it!) and it worked. No good reason, it just helps.
32 bit timer (Score:5, Interesting)
Check out this little pile of bullshit (Score:5, Interesting)
The system offers unprecedented voice quality, touch-screen technology, dynamic reconfiguration capabilities to meet changing needs, and an operational availability of 0.9999999
Okay, bullshit. If I have to reboot a server every month, .0000001 of a month is- oh, let's be generous and only count months with 31 days- about .26 seconds. That's a damned fast boot time for Win2K.
Maybe they left off a percent sign?
We used to joke (Score:4, Interesting)
I also love the statement that the system was upgraded from UNIX to Windows. Isn't this kind of like upgrading from being in very good health but not being good looking to being somewhat good looking but suffering from cancer, AIDS and heart disease?
Flaw left unfixed for too long? (Score:2, Interesting)
Re:In a related story (Score:2, Interesting)
In the middle of the movie, the screen did the classic "blue screen of death" and rebooted with the Windows logo. There were quite a few chuckles in the aircraft when the movie was restarted and then the jokes started flying about the plane running on Microsoft Windows....(uh..oh..we're going to crash!..no wait, that's just Microsoft Windows)
Lessions from other Aviation Authorities (Score:5, Interesting)
I worked for around 5 years in Air Traffic Control projects, both in delivery of radar processing and displays and in R&D for next generation systems.
Let me give you an overview of the failure approach of just one of those systems.
1) Everything on Unix, ruggedised releases of UNIX
2) Every box must be able to FAIL ON ITS OWN
3) Every box must have a direct replacement, or replacements, which carry the SAME LOAD.
4) ZERO total system downtime allowed, partial systems failures are allowed, but core systems must keep running.
5) 5 stages of power supply failure, double mains, double generation and lastly a great big warehouse of car batteries if all else fails.
6) 4 Years of testing of FULL system before live.
This is what is normal when safety is the primary concern. What the FAA decision sounds like is a cost driven process which chose the cheapest solution that "could" meet the requirements.
The idea of a safety critical (if it fails people could die) system that requires a reboot is fine in only one case... if it can be non-operational on a regular basis, in which case it should be done EVERY non-operational window (say every week) , this is therefore okay for some hospital scanners that are certified for 12 hour runs. Its not okay for a 24/7 system that controls objects flying around at 500 miles an hour.
Welcome to the US... we will be landing slightly quicker than expected.
Seen this week at various airports (Score:5, Interesting)
1. Windows-based terminal used by the public to print tickets (I think) with a "you have chosen to download a file, what do you want to do with it: save, open" or similar (I don't recall the exact wording).
2. A windows-based machine that was part of the baggage scanning setup at Chicago-O'Hare going through a scandisk process. OK, this may have been due to operators turing the machine off using the power switch, but should not such a machine use a read-only boot drive/partition?
Do you feel more secure?
You insensitive clod (Score:5, Interesting)
I have the distinct, but sadly not unusual, pleasure of watching my company execute a brilliant strategy of:
Since becoming a PHB (although I still do architecture work - thankfully), I've found that mindless boneheaded, sweeping decisions, are usually driven by some empty-suit, bean-counting, incompetent, barely literate, sh!t-for-brains syncophant who found themselves in an executive position purely by accident. We're "encouraged" to support their "strategies". Indeed...
It's a much higher order PHB. Kinda like a 4th degree black-belt, but not.
Re:Anyone want to clue them in to scheduled jobs? (Score:5, Interesting)
What if that proven systen is decaying out from under you? HD's failing, memory going bad... Tell you what, can you get me new boards for an IBM RT pc? I highly doubt it.
What about "olde" mainframes running assembler code? The pool of expertise is drying up... sometimes you need to pitch the hardware.
Yea, but... (Score:3, Interesting)
Software analyst -> LA Times reporter -> TechWorld reporter.
poor guy.. (Score:2, Interesting)
Second, it took several years to find that bug because most windows machines never made it to that 49.7 days and if they did the users just assumed it was the normal because it is considered normal for windows to "lock up", freeze or whatever.
Third, replacing unix, known for it's stability, with any variant of windows (known for instability) in a system where peoples lives are at stake and then having this happen, the guys at LAX who decided to do this should be fired because they just risked a lot of lives and cause massive delays for travellers. In a political situation they would have to resign.
I remember a similar story about a aegis class cruiser stuck out in the ocean for three days because they decided to use windows. "Yea, that will work great during a war.."
*sigh* Microsoft has good lobby power and hires a fleet of sales people to keep selling their shod-ware that really should just be kept to mom and pop living rooms.
But then, this is the opionion of a guy who works only with linux and is sitting on an uptime on an openmosix cluster-leader (that also is my dev box) that looks like this:
19:03:06 up 319 days, 5:20, 3 users, load average: 1.28, 0.73, 0.37
eat your heart out LAX.. you got punk'd
Space Shuttle accidents and software bugs (Score:5, Interesting)
Re:Why not automate it? (Score:2, Interesting)
Re:Anyone want to clue them in to scheduled jobs? (Score:3, Interesting)
Yeah but maybe they should have replaced it with something that, you know, actually works...
I'm all for change, but I wouldn't swap my car for a brand new sparkling wheelchair, my haircut for a mullet, or my soul/self respect for a job writing VBScript. It just doesn't seem right, you know?
Re:Anyone want to clue them in to scheduled jobs? (Score:2, Interesting)
Re:Now even the submitters aren't reading the arti (Score:1, Interesting)
Its far too great a coincidence that a Windows machine should halt consistently after 40 some days, and that this same bug plagued the Windows operating system.
As you can read in the OP, he questions "could this be?", not "this is".
Suggest you pull your head out.
Uptime: From one of the artticle links (Score:5, Interesting)
Whoah! 7 nines uptime!
22 seconds of downtime per year.
Somebody is on drugs if they sold that. Somebody is on even stronger drugs if they bought that story.
"5 nines", for all intents and purposes, is as good as it gets, with "6 nines" seen as the holy grail. The top HA system I've ever dealt with (running a Telco's billing operation spanning 4 countries!) quoted a figure of 0.999996. To nobody's suprise, it did not run Windows.
Wonder how much their failure clause is going to set them back?
Not necessarily Windows' fault (Score:5, Interesting)
Re:If it's in the job description... (Score:3, Interesting)
Re:Seen this week at various airports (Score:3, Interesting)
The only drawback to XP Embedded, for my company at least, is that the Windows license costs us more than the solid-state drive that we run it from. Looking into Linux for new installations as an alternative, but it doens't make much sense to replace strong, stable XP systems that never fail.
Re:But DON'T get into the habit of using reboot. (Score:3, Interesting)
I don't blame the OS per se... (Score:4, Interesting)
* Bill Gates and developers of Win2000 for the convoluted, kludgy API they designed for their OS
* Product managers at Harris--the crap-for-brains who actually thought changing out robust UNIX servers that weren't really THAT old with consumer-grade PCs running an unproven OS was an UPGRADE to a critical, safety related system. WHAT THE HELL WERE THEY THINKING? In one of the article links (the Harris press release), Harris touted SEVEN NINES reliability! If that was a criteria they should've NEVER considered Windows...Not even BillG himself would say Win2k could provide that sort of uptime!
* Retarded developers at Harris who used an API call that tracks milliseconds in a 32 bit integer despite the fact that bugs related to the use of said function call were WELL KNOWN by that time.
* Dough-heads at LAX and the FAA who, upon finding the error early in development, decided it was OK to rely on MANUAL MONTHLY REBOOTS as a workaround to a potentially fatal problem. They should've run the "upgraded" windows machines in parallel with the UNIX servers for much longer, and failing that they should've IMMEDIATELY restored the old UNIX servers to service as soon as the problem was discovered, and to refuse the upgrade (and revoke payment to Harris) until the problem was properly resolved (and NOT just worked around with a kludge like an email reminder to reboot, or a reboot script or a shutdown warning either).
I'm surprised that this sort of error got into such a critical system, and at the way it was handled. I would've certainly tested the new system in parallel for long enough to catch this sort of error and kept the old system around for longer as a standby (in my experience, replacements of critical systems were often tested in parallel for 3 months to a year). I also would've acted much more decisively in resolving the problem if it did slip through the cracks, given a system crash could put lives in danger.
Maybe my girlfriends fear of flying is more justified than I thought if these are the kind of clowns we trust our safety to...
IBM product support kicks all ass. (Score:2, Interesting)
> highly doubt it.
I've actually dealt with IBM in the "we need support and replacement parts for legacy hardware" capacity before.
And yes, if you've bought IBM in a professional/enterprise capacity, you've also bought the support contract. And if you've bought the support contract (And if you didn't, you deserve to be fired. Why the hell would you pay the IBM premium except for their support?), you can get parts and expert support for damn near everything IBM's ever made; all the way back to card punches/readers, and farther I'd bet. Remember, when you buy IBM, you're buying a MTBF of thirty YEARS.
cya,
john
Downtime vs Failure (Score:5, Interesting)
It appears that in this case Windows has a flaw which they try to work around with routine maintenance during planned downtime.
In your case I would say you have planned downtime for routine maintenance to work around the need for an appropriate system to handle the work load.
I suppose what is the same between these two cases is that you both need to change your system to something that is more appropriate for the task at hand. And to be more specific in the FCC case, Windows should not be allowed for use in any application where life, limb, or property is at risk. Hmm, I suppose that may rule out just about every use.
burnin
Re:Repent, Sinners! (Score:4, Interesting)
Actually I was hit by the max 497 days uptime [jimohalloran.com] bug of Linux 2.4 (and with a desktop machine no less). The box at work did run for about 650 days but anyway well after the mile stone of half way journey for 2nd consecutive uptime reset. Then it was time for me to change rooms. I wasn't at office that day and my co-worker just unplugged the box. Was I pissed or not? Yes I was.
Re:Now even the submitters aren't reading the arti (Score:3, Interesting)
The reboot was to reset the logic flaw in the MS system timer. Read my post here [slashdot.org] on it. It has affected other MS made apps on MS Windows 2000 servers. So if MS's programmers get affected by it, you can expect non-MS employeed programmers to get affected too since they do not have the same level of access to the proprietary OS.
Re:Repent, Sinners! (Score:3, Interesting)
And we were small. I can only imagine what a big school with 30 to 50 thousand students would need done... Not to mention all the DOOM3 wads nowadays.
What if OSS gave them software? (Score:3, Interesting)
Or are the corporate powers that be so out of touch with reality that they wouldn't touch anything having to do with "open sores!"
Re:Anyone want to clue them in to scheduled jobs? (Score:3, Interesting)
I've got a script that pings my upstream router every 10 minutes. If it misses a ping, it waits 30 seconds and tries again. 2 missed pings, and it power cycles my DSL router, using an activehome box and an x10 appliance module.
Re:Repent, Sinners! (Score:4, Interesting)
Re:Seen this week at various airports (Score:2, Interesting)
They _do_ use Duct tape and baling wire (Score:3, Interesting)
I was on the lucky team that *lost* the bidding for the replacement system; IBM's team were the poor bastards who won, and were stuck investing seven years into building an unbuildable replacement, pouring billions of dollars down the drain while being micromanaged by the FAA, who didn't know much about software design or reliability in spite of having a methodology that required producing 175 design documents over the optimistically 3-year design period.
49.7-day bug not exclusive to Windows. (Score:4, Interesting)
That said, the Windows GetTickCount() system call returns a timer value as a 32-bit count of milliseconds since the system was booted. Now, any good programmer knows better than to use GetTickCount() -- there are other, better, more robust ways to tell time in Windows -- but it would not surprise me if a newbie had made the mistake of using this system call in the LAX software, thus leading to the problems.
In other words, the Windows timer is not at fault, but it is possible that one of the programmers was confused by the convoluted Win32 API and made a programming error as a result.
Re:Fire the Department of the Interior's IT staff. (Score:3, Interesting)
Re:It was the app, not the OS (Score:2, Interesting)
Can you imagine? (Score:2, Interesting)
Re:Repent, Sinners! (Score:3, Interesting)
Commence writing.
Commence listening to music.
Commence shutdown procedure.
It works for everything!
Usage of "Start" instead of "Commence" probably has something to do with the majority of the population wondering who was graduating when they clicked the button...
Re:2K is based on NT kernel (Score:3, Interesting)
Just like the "Y2K glitch" was a platform independant problem based upon the 2-digit-year shorthand causing logical flaws, if you store time in a 32-bit variable by the microsecond... you'll hit the hard limit after about 49.7 days which is why that number can show up in kernels other than Win9x. If there's no proper handling of that rollover, things go haywire.
One interesting bit is that Quake 1 servers had problems running for more than 49.7 days for what I assume is precisely the same reason.
Re:Repent, Sinners! (Score:3, Interesting)
Uh, no, no it could not.
Scheduled Tasks in Microsoft Windows have never been reliable. Quite frequently mine have their security credentials "screwed up" somehow and stop working until I notice and "touch" them so I'm forced to re-enter a user/pwd.
I have never EVER heard of Solaris cron failing to run on time.
> and not some poor person manually initiating it every night?
It's windows, you have to have a person present to ensure that the system actually a) goes down b) comes back up as intended.
I've done a half year consulting gig and spent a month walking 5 blocks through the downtown core of San Francisco at 5am every single FUCKING morning to hit the power button on a 4 way 400 MHz $50,000 Compaq windows box at one of the biggest banks in the world. Database held holdings information on around half a trillion dollars in equities.
MS EULA Says.... (Score:1, Interesting)
So ummm, unless MS suddenly created a hardened RTOS, why the fsck is this thing even running anywhere near ATC?
I say FIRE the morons who installed it, ordered it, designed it, and sold it... Finally, FINE the hell out of the asshole company that wrote it and allowed it to be sold for that use... I'd say $150/hr PER person inconvenienced by this debacle, PLUS whatever the airlines lost (or might have earned) PLUS a punitative sanction to make it fucking hurt bad enough that they'll realize that this can't ever occur again - I'd say $15 billion would do it...
A Win95 issue presents itself in W2K? (Score:1, Interesting)
I would find it hard to believe that they were installing Win9x OR that Win2K+ was effected by this bug as I have found no current documentation pointing this bug to an installed W2K+ OS.
Blah, blah, blah.
Re:Lessions from other Aviation Authorities (Score:1, Interesting)
Making incremental changes to the existing system, such as faster hardware, would almost certainly have met the requirements and been much cheaper than the solution chosen.
I think someone just got a bad case of "shiny thing" and thousands of travellers ended up paying (fortunately not with their lives).
Hypocrites? (Score:3, Interesting)
I'm throwing stones, now - especially after reading this incredibly long and geeky thread about shutting down your OS variants. God bless you for having multiple ways of shutting down/halting/suspending/restarting your computer in user/superuser/megauser/whosyourdaddyuser modes, but shame on you for being a stickler on MS's decision to place a Shutdown option on the "Start" menu when you can't even agree on how to shut your own damned computers down!
It's hypocritical, pharisitical, and parasitical (I like alliterations, even when they're not in context...makes me feel like Don King) to bring up such an argument as "Please press the Start button to shut down (stop) the computer". I'm not saying that "Start" is the most incredible choice for a button, but it makes sense. If you are shutting down your computer, you START THE SHUTDOWN PROCESS.
Old OS/2 Bug, Not Windows 95 (Score:2, Interesting)
IBM ran into the problem quicker, as OS/2 was adopted for various critical things like Automated Teller Machines (ATMs), while Windows NT was mostly used for simple file servers. As a result, the problem was fixed in OS/2 about 2 years before in Microsoft got around to fixing the problem in Windows.
Considering that I remember this patch existing for Windows NT and 2000 back in 1999, it is disheartening that the FAA did not feel it necessary to upgrade to something as simple and critical as Service Pack 2 or 3.
Re:Repent, Sinners! (Score:3, Interesting)
I guess when you grow up and get out of high school, you will find that your linux box running as a DSL router is not a good example of a production server.
Yeah they did that to the Linux boxes here, because they didn't know better. Now, with real Linux experts, our Linuxen are not rebooted or taken down for routine maintenance. And no we aren't talking about "DSL Routers". We are talking about systems that process email to the tune of a million message per server per day.
Critical? You bet it is. Merril Lynch, HP, APL, and many others. Planned downtime for "regular maintenance"? Nope. The only time we plan downtime is for hardware replacement/upgrade and kernel upgrades, the occasional (rare) server moves, and full data center shutdowns to perform data center failover verification.
I guess when you grow up and get out of community college, you'll find that running a dormitory quake server is not a good example of a business critical production server.