Windows Upgrade, FAA Error Cause LAX Shutdown 862

Posted by michael on Tuesday September 21, 2004 @05:48PM from the first-woodpecker-to-come-along dept.

fname writes "The recent shutdown of LAX due to an FAA radio outage was apparently caused by a Windows 2000 integration flaw, possibility related to an old Windows 95 bug. An article at the LA Times claims that the outage was caused by human error, as the system will automatically shut down after 49.7 days (related to this Windows 95 flaw?), and a technician didn't reboot the system monthly as he should have. This happened after an upgrade from Unix to Windows. I don't think blame should be assigned to the technician who missed the task; rather, it seems a gross oversight for the FAA to guarantee that such a critical system will crash after only one missed maintenance task. Who's really at fault?"

This discussion has been archived. No new comments can be posted.

Windows Upgrade, FAA Error Cause LAX Shutdown

Search 862 Comments Log In/Create an Account

Comments Filter:

A hit for the other team... (Score:4, Interesting)

by LostCluster ( 625375 ) * writes: on Tuesday September 21, 2004 @05:52PM (#10313344)

When a ball drops on a baseball field at the midpoint between two positions, it's scored a "hit" for the opposition rather than an "error" against either player. Still, a hit for the other side is a bad thing for the entire team.

This mess was big enough that there's a large enough supply of blame to give some to everybody involved.

- No system should require a manual reboot on a regular basis... there should at least be a script capable of accomplishing that. But somehow, one got implemented. Blame whoever bought it.
- Windows shouldn't have had a faw that required monthly reboots. Blame Microsoft.
- Somebody should have done the reboots like they were told to. Blame that poor smuck.

Bottom line is that everybody's at fault because had any one piece in the chain done their job properly the failure wouldn't have happened, but a cascade of mistakes lead to the ball hitting the grass instead of a glove.

Share
twitter facebook
Re:Anyone want to clue them in to scheduled jobs? (Score:4, Interesting)

by TykeClone ( 668449 ) writes: <TykeClone@gmail.com> on Tuesday September 21, 2004 @05:54PM (#10313378) Homepage Journal

at sucks. Very, very much.
I've got an NT server that would hang after 2 weeks. I set up an at job to restart that service nightly and do not have that problem.
I've also got several linux servers that just plain run (and some NT/2000 servers as well).
That being said, rebooting sometimes does clear up many evils. We have a speakerphone (around 10 years old - no OS) that just wouldn't work one day. After looking at it, I unplgged it and plugged it back in (I rebooted it!) and it worked. No good reason, it just helps.

Parent Share
twitter facebook
32 bit timer (Score:5, Interesting)

by charnov ( 183495 ) writes: on Tuesday September 21, 2004 @05:57PM (#10313419) Homepage Journal

This old error was from the use of a 32 bit 1 ms increment timer (comes out to 49.7 days until rollover). AFAIK, this was fixed in Win2k and above when the timer got bumped to 64 bit. Maybe whoever set up LAX was using some ancient legacy middleware that used the old timer. This is just bizarre. In both locations that I have worked the last three years, none of the Win2k or Win2k3 servers went down ever. Sounds like bad consultants.

Share
twitter facebook
Check out this little pile of bullshit (Score:5, Interesting)

by Trailer Trash ( 60756 ) writes: on Tuesday September 21, 2004 @05:58PM (#10313431) Homepage

The system offers unprecedented voice quality, touch-screen technology, dynamic reconfiguration capabilities to meet changing needs, and an operational availability of 0.9999999

Okay, bullshit. If I have to reboot a server every month, .0000001 of a month is- oh, let's be generous and only count months with 31 days- about .26 seconds. That's a damned fast boot time for Win2K.

Maybe they left off a percent sign?

Share
twitter facebook
We used to joke (Score:4, Interesting)

by multiplexo ( 27356 ) * writes: on Tuesday September 21, 2004 @05:59PM (#10313440) Journal

that no one would ever run into the 49.7 day bug on a Windows system because the chances of having that much uptime were slim to none. Having a system where you know that things are broken and you have to reboot it every 30 days to keep it from breaking down is a bad thing, deploying such a system into a production environment is even worse (but it's been done, I don't know how many times I wrote cron jobs to kill bad pieces of software and restart them) but deploying such a system in an environment where lives are at stake is completely inexcusable, regardless of whether or not it is closed or open source. This is similar to having a circuit in your house that overheats because occasionally too much load is placed on it. The idiot solution is to reset the breaker when it trips, the correct solution is to put in a bigger circuit that can handle the peak load. This vendor provided the idiot solution to this problem and should be punished for it, this never should have been deployed, I can only hope that they won't blame the technician for failing to do something that he wouldn't have had to do if the system had been designed properly.
I also love the statement that the system was upgraded from UNIX to Windows. Isn't this kind of like upgrading from being in very good health but not being good looking to being somewhat good looking but suffering from cancer, AIDS and heart disease?

Share
twitter facebook
Flaw left unfixed for too long? (Score:2, Interesting)

by Astro-pilot ( 765980 ) writes: on Tuesday September 21, 2004 @06:01PM (#10313463)

Was the flaw left unfixed for too long because they did not have access to the source code? Or was it because it was too expensive? If this is such a critical system that it can cause loss of life (on a massive scale, no less), the root cause should have been fixed, rather than the workaround. I remember reading somewhere that this flaw has now been fixed. Smells like a managerial issue within the FAA, not just a technician problem. Remember NASA and the space shuttles?

Share
twitter facebook
Re:In a related story (Score:2, Interesting)

by databank ( 165049 ) writes: on Tuesday September 21, 2004 @06:01PM (#10313465)

Actually there's a lot of truth to that..I once flew in an airliner overseas which had the tv screens built into the back of the seat in front of me.

In the middle of the movie, the screen did the classic "blue screen of death" and rebooted with the Windows logo. There were quite a few chuckles in the aircraft when the movie was restarted and then the jokes started flying about the plane running on Microsoft Windows....(uh..oh..we're going to crash!..no wait, that's just Microsoft Windows)

Parent Share
twitter facebook
Lessions from other Aviation Authorities (Score:5, Interesting)

by MosesJones ( 55544 ) writes: on Tuesday September 21, 2004 @06:06PM (#10313523) Homepage

I worked for around 5 years in Air Traffic Control projects, both in delivery of radar processing and displays and in R&D for next generation systems.

Let me give you an overview of the failure approach of just one of those systems.

1) Everything on Unix, ruggedised releases of UNIX

2) Every box must be able to FAIL ON ITS OWN

3) Every box must have a direct replacement, or replacements, which carry the SAME LOAD.

4) ZERO total system downtime allowed, partial systems failures are allowed, but core systems must keep running.

5) 5 stages of power supply failure, double mains, double generation and lastly a great big warehouse of car batteries if all else fails.

6) 4 Years of testing of FULL system before live.

This is what is normal when safety is the primary concern. What the FAA decision sounds like is a cost driven process which chose the cheapest solution that "could" meet the requirements.

The idea of a safety critical (if it fails people could die) system that requires a reboot is fine in only one case... if it can be non-operational on a regular basis, in which case it should be done EVERY non-operational window (say every week) , this is therefore okay for some hospital scanners that are certified for 12 hour runs. Its not okay for a 24/7 system that controls objects flying around at 500 miles an hour.

Welcome to the US... we will be landing slightly quicker than expected.

Share
twitter facebook
Seen this week at various airports (Score:5, Interesting)

by whoever57 ( 658626 ) writes: on Tuesday September 21, 2004 @06:07PM (#10313536) Journal

This week, while flying, I saw:
1. Windows-based terminal used by the public to print tickets (I think) with a "you have chosen to download a file, what do you want to do with it: save, open" or similar (I don't recall the exact wording).

2. A windows-based machine that was part of the baggage scanning setup at Chicago-O'Hare going through a scandisk process. OK, this may have been due to operators turing the machine off using the power switch, but should not such a machine use a read-only boot drive/partition?

Do you feel more secure?

Share
twitter facebook
You insensitive clod (Score:5, Interesting)

by rutledjw ( 447990 ) writes: on Tuesday September 21, 2004 @06:09PM (#10313562) Homepage
As a PHB, I resemble that remark! Clearly you do not appreciate the fine art which is combining management and technical decision-making. Neither does my parent corp.
I have the distinct, but sadly not unusual, pleasure of watching my company execute a brilliant strategy of:
1. Outsouring Data Center Operations (systems that used to down for seconds a year are now down for days and in some cases weeks per year)
2. Outsource development to India (which has been a mess I won't use the foul language to describe) _AND_
3. Squeeze remaining people to make up for items 1 and 2!
Since becoming a PHB (although I still do architecture work - thankfully), I've found that mindless boneheaded, sweeping decisions, are usually driven by some empty-suit, bean-counting, incompetent, barely literate, sh!t-for-brains syncophant who found themselves in an executive position purely by accident. We're "encouraged" to support their "strategies". Indeed...
It's a much higher order PHB. Kinda like a 4th degree black-belt, but not.
Parent Share
twitter facebook
Re:Anyone want to clue them in to scheduled jobs? (Score:5, Interesting)

by mekkab ( 133181 ) writes: on Tuesday September 21, 2004 @06:10PM (#10313568) Homepage Journal

It's obviously lunacy for any company to replace a proven system, which has given years of reliable service with some piece of trash that crashes if left running for over a month

What if that proven systen is decaying out from under you? HD's failing, memory going bad... Tell you what, can you get me new boards for an IBM RT pc? I highly doubt it.

What about "olde" mainframes running assembler code? The pool of expertise is drying up... sometimes you need to pitch the hardware.

Parent Share
twitter facebook
Yea, but... (Score:3, Interesting)

by HaeMaker ( 221642 ) writes: on Tuesday September 21, 2004 @06:13PM (#10313592) Homepage

That information had been filtered at least three times, can't count on that either...

Software analyst -> LA Times reporter -> TechWorld reporter.

Parent Share
twitter facebook
poor guy.. (Score:2, Interesting)

by joeldg ( 518249 ) writes: on Tuesday September 21, 2004 @06:14PM (#10313603) Homepage

Having to shutdown a system to maintain it's uptime is first a ridiculous idea.

Second, it took several years to find that bug because most windows machines never made it to that 49.7 days and if they did the users just assumed it was the normal because it is considered normal for windows to "lock up", freeze or whatever.

Third, replacing unix, known for it's stability, with any variant of windows (known for instability) in a system where peoples lives are at stake and then having this happen, the guys at LAX who decided to do this should be fired because they just risked a lot of lives and cause massive delays for travellers. In a political situation they would have to resign.

I remember a similar story about a aegis class cruiser stuck out in the ocean for three days because they decided to use windows. "Yea, that will work great during a war.."

*sigh* Microsoft has good lobby power and hires a fleet of sales people to keep selling their shod-ware that really should just be kept to mom and pop living rooms.

But then, this is the opionion of a guy who works only with linux and is sitting on an uptime on an openmosix cluster-leader (that also is my dev box) that looks like this:
19:03:06 up 319 days, 5:20, 3 users, load average: 1.28, 0.73, 0.37

eat your heart out LAX.. you got punk'd

Share
twitter facebook
Space Shuttle accidents and software bugs (Score:5, Interesting)

by BlueUnderwear ( 73957 ) writes: on Tuesday September 21, 2004 @06:17PM (#10313629)

Was at JAOO today, and on the closing panel discussion for the Test-Driven Development track, Mr Kevlin Henney was praising NASA's rigorous software testing procedures. He was so proud of them that he let out a "and in both space shuttle crashes, software was not to blame". Well, this may be correct if he was thinking only about the flight software... but there [theinquirer.net] is other [usatoday.com] software [kiwiblog.co.nz] than what rides in the shuttle itself...

Parent Share
twitter facebook
Re:Why not automate it? (Score:2, Interesting)

by bstone ( 145356 ) writes: on Tuesday September 21, 2004 @06:18PM (#10313637)

I don't see the logic in a system being so critical to be working 24/7 that they force it to crash if the maintenance is missed. Does anyone else see a problem with this logic?

Parent Share
twitter facebook
Re:Anyone want to clue them in to scheduled jobs? (Score:3, Interesting)

by FyRE666 ( 263011 ) * writes: on Tuesday September 21, 2004 @06:23PM (#10313673) Homepage

What about "olde" mainframes running assembler code? The pool of expertise is drying up... sometimes you need to pitch the hardware.

Yeah but maybe they should have replaced it with something that, you know, actually works...

I'm all for change, but I wouldn't swap my car for a brand new sparkling wheelchair, my haircut for a mullet, or my soul/self respect for a job writing VBScript. It just doesn't seem right, you know?

Parent Share
twitter facebook
Re:Anyone want to clue them in to scheduled jobs? (Score:2, Interesting)

by LifesABeach ( 234436 ) writes: on Tuesday September 21, 2004 @06:27PM (#10313712) Homepage

Well, I guess I've seen a first here. The system was 'upgraded' to Windows 2000? The manager that made that decision has done more than any staff member at Bin-Laden University for the Scrambled of Brains.

Parent Share
twitter facebook
Re:Now even the submitters aren't reading the arti (Score:1, Interesting)

by Anonymous Coward writes: on Tuesday September 21, 2004 @06:27PM (#10313714)

No, the OP is using something called "inference". In fact, I am infering that the OP was infering that the journalist reporting the article either doesn't understand the 32 bit rollover problem or does not want to report all the details required to describe the 32 bit rollover problem.

Its far too great a coincidence that a Windows machine should halt consistently after 40 some days, and that this same bug plagued the Windows operating system.

As you can read in the OP, he questions "could this be?", not "this is".

Suggest you pull your head out.

Parent Share
twitter facebook
Uptime: From one of the artticle links (Score:5, Interesting)

by Mateito ( 746185 ) writes: on Tuesday September 21, 2004 @06:28PM (#10313727) Homepage

The system offers unprecedented voice quality, touch-screen technology, dynamic reconfiguration capabilities to meet changing needs, and an operational availability of 0.9999999.
Whoah! 7 nines uptime!
22 seconds of downtime per year.
Somebody is on drugs if they sold that. Somebody is on even stronger drugs if they bought that story.
"5 nines", for all intents and purposes, is as good as it gets, with "6 nines" seen as the holy grail. The top HA system I've ever dealt with (running a Telco's billing operation spanning 4 countries!) quoted a figure of 0.999996. To nobody's suprise, it did not run Windows.
Wonder how much their failure clause is going to set them back?

Share
twitter facebook
Not necessarily Windows' fault (Score:5, Interesting)

by DunbarTheInept ( 764 ) writes: on Tuesday September 21, 2004 @06:29PM (#10313732) Homepage

While I hate MS as much as the next guy, this might not really be directly their fault. Unix systems are often installed with the instruction taht they get reboots regularly. Often there is a problem that is caused by application code not the OS. If you have a memory leak in an application that runs and stays up all the time, it's going to cause the system to get horribly unusalbe in the long run regardless of whether it's UNIX or Windows. While a reboot might be overkill when it was just one application misbehaving, a reboot is a guaranteed way to kill and reset the responsible program no matter which one it is. At a previous place of employment we told the customer to do monthly reboots mainly because we didn't trust *our own* code to be that perfect.

Share
twitter facebook
Re:If it's in the job description... (Score:3, Interesting)

by pfleming ( 683342 ) writes: on Tuesday September 21, 2004 @06:32PM (#10313758) Homepage Journal

How many guns have you seen that fire on a monthly basis unless you 'prevented' it?

Parent Share
twitter facebook
Re:Seen this week at various airports (Score:3, Interesting)

by Anonymous Coward writes: on Tuesday September 21, 2004 @06:33PM (#10313767)

It probably should. My company uses XP Embedded for a few systems, and doesn't have any software-related problems on them. Ever. The only problems we have are when people snap off antennae that we use for the wireless connections, or something similar. There's no reason that they shouldn't be using something like this to scan baggage. It sounds like someone at O'Hare didn't do their homework.

The only drawback to XP Embedded, for my company at least, is that the Windows license costs us more than the solid-state drive that we run it from. Looking into Linux for new installations as an alternative, but it doens't make much sense to replace strong, stable XP systems that never fail.

Parent Share
twitter facebook
Re:But DON'T get into the habit of using reboot. (Score:3, Interesting)

by drinkypoo ( 153816 ) writes: <drink@hyperlogos.org> on Tuesday September 21, 2004 @07:02PM (#10314054) Homepage Journal

The funny thing is that halt used to halt the system RIGHT GODDAMN NOW on most Unixes, and famously on Xenix. They called it haltsys and you typed sync twice before running it. The second one was just to give the system time to sync while your fingers were moving. Most Xenix systems didn't have much of a buffer (I had Xenix on a 286 with 1MB RAM, but the 386 product was of course much more popular) but they don't have much of a filesystem either. Anyway other elderly Unixes and Unix derivatives are simple like that too. Halt just halts, it doesn't stroke you first.

Parent Share
twitter facebook
I don't blame the OS per se... (Score:4, Interesting)

by WebCowboy ( 196209 ) writes: on Tuesday September 21, 2004 @07:37PM (#10314336)

...but I blame a lot of people for carelessness and incompetence (except for the actual techie that forgot to reboot last month--that is an honest mistake).

* Bill Gates and developers of Win2000 for the convoluted, kludgy API they designed for their OS

* Product managers at Harris--the crap-for-brains who actually thought changing out robust UNIX servers that weren't really THAT old with consumer-grade PCs running an unproven OS was an UPGRADE to a critical, safety related system. WHAT THE HELL WERE THEY THINKING? In one of the article links (the Harris press release), Harris touted SEVEN NINES reliability! If that was a criteria they should've NEVER considered Windows...Not even BillG himself would say Win2k could provide that sort of uptime!

* Retarded developers at Harris who used an API call that tracks milliseconds in a 32 bit integer despite the fact that bugs related to the use of said function call were WELL KNOWN by that time.

* Dough-heads at LAX and the FAA who, upon finding the error early in development, decided it was OK to rely on MANUAL MONTHLY REBOOTS as a workaround to a potentially fatal problem. They should've run the "upgraded" windows machines in parallel with the UNIX servers for much longer, and failing that they should've IMMEDIATELY restored the old UNIX servers to service as soon as the problem was discovered, and to refuse the upgrade (and revoke payment to Harris) until the problem was properly resolved (and NOT just worked around with a kludge like an email reminder to reboot, or a reboot script or a shutdown warning either).

I'm surprised that this sort of error got into such a critical system, and at the way it was handled. I would've certainly tested the new system in parallel for long enough to catch this sort of error and kept the old system around for longer as a standby (in my experience, replacements of critical systems were often tested in parallel for 3 months to a year). I also would've acted much more decisively in resolving the problem if it did slip through the cracks, given a system crash could put lives in danger.

Maybe my girlfriends fear of flying is more justified than I thought if these are the kind of clowns we trust our safety to...

Parent Share
twitter facebook
IBM product support kicks all ass. (Score:2, Interesting)

by SvnLyrBrto ( 62138 ) writes: on Tuesday September 21, 2004 @07:49PM (#10314438)

> Tell you what, can you get me new boards for an IBM RT pc? I
> highly doubt it.

I've actually dealt with IBM in the "we need support and replacement parts for legacy hardware" capacity before.

And yes, if you've bought IBM in a professional/enterprise capacity, you've also bought the support contract. And if you've bought the support contract (And if you didn't, you deserve to be fired. Why the hell would you pay the IBM premium except for their support?), you can get parts and expert support for damn near everything IBM's ever made; all the way back to card punches/readers, and farther I'd bet. Remember, when you buy IBM, you're buying a MTBF of thirty YEARS.

cya,
john

Parent Share
twitter facebook
Downtime vs Failure (Score:5, Interesting)

by burnin1965 ( 535071 ) writes: on Tuesday September 21, 2004 @07:55PM (#10314484) Homepage

I'm not sure exactly what downtime for routine maintenance on an AIX system running DBase has to do with a Windows bug that causes a system failure. However, in response, there is a difference between planned downtime where a service is made unavailable while planned routine maintenance is performed and planned downtime or an unplanned failure due to a flaw in the system.

It appears that in this case Windows has a flaw which they try to work around with routine maintenance during planned downtime.

In your case I would say you have planned downtime for routine maintenance to work around the need for an appropriate system to handle the work load.

I suppose what is the same between these two cases is that you both need to change your system to something that is more appropriate for the task at hand. And to be more specific in the FCC case, Windows should not be allowed for use in any application where life, limb, or property is at risk. Hmm, I suppose that may rule out just about every use. :P

burnin

Parent Share
twitter facebook
Re:Repent, Sinners! (Score:4, Interesting)

by Turmio ( 29215 ) writes: on Tuesday September 21, 2004 @08:21PM (#10314680) Homepage

You have to love a system that requires downtime as part of uptime. How many Linux users have this problem?
Actually I was hit by the max 497 days uptime [jimohalloran.com] bug of Linux 2.4 (and with a desktop machine no less). The box at work did run for about 650 days but anyway well after the mile stone of half way journey for 2nd consecutive uptime reset. Then it was time for me to change rooms. I wasn't at office that day and my co-worker just unplugged the box. Was I pissed or not? Yes I was.

Parent Share
twitter facebook
Re:Now even the submitters aren't reading the arti (Score:3, Interesting)

by AstroDrabb ( 534369 ) writes: on Tuesday September 21, 2004 @08:25PM (#10314712)

The shutdown is not a crash but a scheduled event to bring the servers down to flush data.

That is MS PHB speek to "assure" other PHB's that it was not MS's fault. What _modern_ server OS needs to reboot to flush freakin data! Why do you think technical details are never released in these types of press releases?
The reboot was to reset the logic flaw in the MS system timer. Read my post here [slashdot.org] on it. It has affected other MS made apps on MS Windows 2000 servers. So if MS's programmers get affected by it, you can expect non-MS employeed programmers to get affected too since they do not have the same level of access to the proprietary OS.

Parent Share
twitter facebook
Re:Repent, Sinners! (Score:3, Interesting)

by valkraider ( 611225 ) writes: on Tuesday September 21, 2004 @08:53PM (#10314896) Journal

Our college did batch runs for all sorts of stuff. We only had about 3000 students, but between faculty and staff it worked out to around 5000 people in various systems. Things had to run to calculate and process dorm room phone bills, cafeteria plans, accounts payable and recievable, invoices, transcripts, and DOOM wads...

And we were small. I can only imagine what a big school with 30 to 50 thousand students would need done... Not to mention all the DOOM3 wads nowadays. ;)

Parent Share
twitter facebook
What if OSS gave them software? (Score:3, Interesting)

by Mustang Matt ( 133426 ) writes: on Tuesday September 21, 2004 @09:43PM (#10315204)

What would happen if a group of people out of the goodness of their hearts wrote them a new system that truly did everything they needed. Would they adopt it?

Or are the corporate powers that be so out of touch with reality that they wouldn't touch anything having to do with "open sores!"

Share
twitter facebook
Re:Anyone want to clue them in to scheduled jobs? (Score:3, Interesting)

by DarkVader ( 121278 ) writes: on Tuesday September 21, 2004 @09:53PM (#10315258)

A nightly reboot seems like a sledgehammer approach to me.

I've got a script that pings my upstream router every 10 minutes. If it misses a ping, it waits 30 seconds and tries again. 2 missed pings, and it power cycles my DSL router, using an activehome box and an x10 appliance module.

Parent Share
twitter facebook
Re:Repent, Sinners! (Score:4, Interesting)

by dgatwood ( 11270 ) writes: on Tuesday September 21, 2004 @10:06PM (#10315338) Homepage Journal

As another link in this discussion noted, an unpatched Win2k does, in fact, require a reboot every 47.5 days because a certain process goes nuts and eats 60% of the CPU. The fact that MS has a patch for the problem does not mean that the problem does not exist.

Parent Share
twitter facebook
Re:Seen this week at various airports (Score:2, Interesting)

by Anonymous Coward writes: on Tuesday September 21, 2004 @10:17PM (#10315395)

Perfect timing for this comment. I was in the airport yesterday (Detroit). The screens over the metal detectors/ carryon xray machines do nothing except tell you whether the lane is open (a large arrow) or closed (a large X). 4 of the lanes had some sort of Windows error message. Apparently they couldn't handle the workload.

Parent Share
twitter facebook
They _do_ use Duct tape and baling wire (Score:3, Interesting)

by billstewart ( 78916 ) writes: on Tuesday September 21, 2004 @10:34PM (#10315482) Journal

Back when I was working on ARTCC replacement in the late 80s, during the daytime they were running the "modern" 1960s IBM System 360/90 system, which was an ugly undocumented unmaintainable hack job written mostly in JOVIAL. For about four hours a night, they'd run the backup system EDARC, which was an 1970s "Enhanced" version of the 1950s "DARC" radar controller. There were all sorts of parts you couldn't get back in the 1980s - IBM had stopped making the "Serpentine" cable connector, for instance.
I was on the lucky team that *lost* the bidding for the replacement system; IBM's team were the poor bastards who won, and were stuck investing seven years into building an unbuildable replacement, pouring billions of dollars down the drain while being micromanaged by the FAA, who didn't know much about software design or reliability in spite of having a methodology that required producing 175 design documents over the optimistically 3-year design period.

Parent Share
twitter facebook
49.7-day bug not exclusive to Windows. (Score:4, Interesting)

by Temporal ( 96070 ) writes: on Tuesday September 21, 2004 @11:23PM (#10315789) Journal

It may seem suspicious that the max uptime of the LAX system is the same as the max uptime of a Windows 95 box... until you realize that 49.7 days is 2^32 milliseconds. If you have a piece of software that counts milliseconds using a 32-bit integer, it will inevitably roll over after 49.7 days and -- unless designed to compensate for it -- will probably crash. Windows 95 is certainly not the only piece of software that counts milliseconds in a 32-bit integer.

That said, the Windows GetTickCount() system call returns a timer value as a 32-bit count of milliseconds since the system was booted. Now, any good programmer knows better than to use GetTickCount() -- there are other, better, more robust ways to tell time in Windows -- but it would not surprise me if a newbie had made the mistake of using this system call in the LAX software, thus leading to the problems.

In other words, the Windows timer is not at fault, but it is possible that one of the programmers was confused by the convoluted Win32 API and made a programming error as a result.

Share
twitter facebook
Re:Fire the Department of the Interior's IT staff. (Score:3, Interesting)

by tbogart ( 802762 ) writes: <tjbogart33@gmail.com> on Tuesday September 21, 2004 @11:41PM (#10315895)

Looking at the www.faa.gov home page, it says "Department of Transportation". However, having been a systems engineer and administrator in a couple of stints at one of the DOI Bureaus ... you don't want to know.

Parent Share
twitter facebook
Re:It was the app, not the OS (Score:2, Interesting)

by tbogart ( 802762 ) writes: <tjbogart33@gmail.com> on Tuesday September 21, 2004 @11:53PM (#10315943)

Just curious - but how does being a pilot give you more insight into the system? I would particularly like to see the "memory allocation error that retains some of the old tracking on the system". That would be quite amazing in itself.

Parent Share
twitter facebook
Can you imagine? (Score:2, Interesting)

by dickens ( 31040 ) writes: on Wednesday September 22, 2004 @12:00AM (#10315979) Homepage

Can you imagine knowing about this problem, putting it into production and not riding your MS rep like a pony until it was verified fixed ? ...with any other vendor.. sheesh.. but I guess it doesn't work that way with MS - even for the FAA.

Share
twitter facebook
Re:Repent, Sinners! (Score:3, Interesting)

by adamfranco ( 600246 ) writes: <adam@adamfrPERIODanco.com minus punct> on Wednesday September 22, 2004 @12:16AM (#10316044) Homepage

I personally like "Commence".

Commence writing.
Commence listening to music.
Commence shutdown procedure.

It works for everything!

Usage of "Start" instead of "Commence" probably has something to do with the majority of the population wondering who was graduating when they clicked the button...

Parent Share
twitter facebook
Re:2K is based on NT kernel (Score:3, Interesting)

by omicronish ( 750174 ) writes: on Wednesday September 22, 2004 @12:27AM (#10316086)

Just like the "Y2K glitch" was a platform independant problem based upon the 2-digit-year shorthand causing logical flaws, if you store time in a 32-bit variable by the microsecond... you'll hit the hard limit after about 49.7 days which is why that number can show up in kernels other than Win9x. If there's no proper handling of that rollover, things go haywire.

One interesting bit is that Quake 1 servers had problems running for more than 49.7 days for what I assume is precisely the same reason.

Parent Share
twitter facebook
Re:Repent, Sinners! (Score:3, Interesting)

by ckedge ( 192996 ) writes: on Wednesday September 22, 2004 @01:02AM (#10316208) Journal

> a) This could easily been done as a sheduled task in windows 2000.

Uh, no, no it could not.

Scheduled Tasks in Microsoft Windows have never been reliable. Quite frequently mine have their security credentials "screwed up" somehow and stop working until I notice and "touch" them so I'm forced to re-enter a user/pwd.

I have never EVER heard of Solaris cron failing to run on time.

> and not some poor person manually initiating it every night?

It's windows, you have to have a person present to ensure that the system actually a) goes down b) comes back up as intended.

I've done a half year consulting gig and spent a month walking 5 blocks through the downtown core of San Francisco at 5am every single FUCKING morning to hit the power button on a 4 way 400 MHz $50,000 Compaq windows box at one of the biggest banks in the world. Database held holdings information on around half a trillion dollars in equities.

Parent Share
twitter facebook
MS EULA Says.... (Score:1, Interesting)

by Anonymous Coward writes: on Wednesday September 22, 2004 @01:30AM (#10316319)

Hmmm, I seem to recall glancing at the MS EULA one time (when they were printed on the disk-envelope (that I never opened - it came that way - honest), and the thing said in part that it wasn't to be used in life-safety operations, for running a nuke plant, air traffic control, or other real-time operations...

So ummm, unless MS suddenly created a hardened RTOS, why the fsck is this thing even running anywhere near ATC?

I say FIRE the morons who installed it, ordered it, designed it, and sold it... Finally, FINE the hell out of the asshole company that wrote it and allowed it to be sold for that use... I'd say $150/hr PER person inconvenienced by this debacle, PLUS whatever the airlines lost (or might have earned) PLUS a punitative sanction to make it fucking hurt bad enough that they'll realize that this can't ever occur again - I'd say $15 billion would do it...

Share
twitter facebook
A Win95 issue presents itself in W2K? (Score:1, Interesting)

by Anonymous Coward writes: on Wednesday September 22, 2004 @02:42AM (#10316565)

Sooooo. They were converting to Windows, eh? Do we really think they were installing Win95 anytime recently to force this bug unto themselves?

I would find it hard to believe that they were installing Win9x OR that Win2K+ was effected by this bug as I have found no current documentation pointing this bug to an installed W2K+ OS.

Blah, blah, blah.

Share
twitter facebook
Re:Lessions from other Aviation Authorities (Score:1, Interesting)

by Anonymous Coward writes: on Wednesday September 22, 2004 @02:45AM (#10316579)

What the FAA decision sounds like is a cost driven process which chose the cheapest solution that "could" meet the requirements.

Making incremental changes to the existing system, such as faster hardware, would almost certainly have met the requirements and been much cheaper than the solution chosen.

I think someone just got a bad case of "shiny thing" and thousands of travellers ended up paying (fortunately not with their lives).

Parent Share
twitter facebook
Hypocrites? (Score:3, Interesting)

by coronaride ( 222264 ) writes: <coronaride@yahPO ... om minus painter> on Wednesday September 22, 2004 @02:58AM (#10316617)

This is not addressed to the parent, but is for everyone who responded to the parent -

I'm throwing stones, now - especially after reading this incredibly long and geeky thread about shutting down your OS variants. God bless you for having multiple ways of shutting down/halting/suspending/restarting your computer in user/superuser/megauser/whosyourdaddyuser modes, but shame on you for being a stickler on MS's decision to place a Shutdown option on the "Start" menu when you can't even agree on how to shut your own damned computers down!

It's hypocritical, pharisitical, and parasitical (I like alliterations, even when they're not in context...makes me feel like Don King) to bring up such an argument as "Please press the Start button to shut down (stop) the computer". I'm not saying that "Start" is the most incredible choice for a button, but it makes sense. If you are shutting down your computer, you START THE SHUTDOWN PROCESS.

Parent Share
twitter facebook
Old OS/2 Bug, Not Windows 95 (Score:2, Interesting)

by JohnThreePound ( 513218 ) writes: on Wednesday September 22, 2004 @07:13AM (#10317236)

As I recall, since Windows 2000/NT was once the same product as IBM OS/2 (remember Microsoft OS/2, anybody?), this bug originated from the OS/2 side of the codebase.

IBM ran into the problem quicker, as OS/2 was adopted for various critical things like Automated Teller Machines (ATMs), while Windows NT was mostly used for simple file servers. As a result, the problem was fixed in OS/2 about 2 years before in Microsoft got around to fixing the problem in Windows.

Considering that I remember this patch existing for Windows NT and 2000 back in 1999, it is disheartening that the FAA did not feel it necessary to upgrade to something as simple and critical as Service Pack 2 or 3.

Share
twitter facebook
Re:Repent, Sinners! (Score:3, Interesting)

by Shadowlore ( 10860 ) writes: on Wednesday September 22, 2004 @07:51AM (#10317352) Journal

Well, in the past 10 years I have had a number of clients who have had Linux, Unix, Windows, and Mac systems that were critical to their day to day routine and they did nightly/weekly/monthly reboots as part of their maintenance.

I guess when you grow up and get out of high school, you will find that your linux box running as a DSL router is not a good example of a production server.

Yeah they did that to the Linux boxes here, because they didn't know better. Now, with real Linux experts, our Linuxen are not rebooted or taken down for routine maintenance. And no we aren't talking about "DSL Routers". We are talking about systems that process email to the tune of a million message per server per day.

Critical? You bet it is. Merril Lynch, HP, APL, and many others. Planned downtime for "regular maintenance"? Nope. The only time we plan downtime is for hardware replacement/upgrade and kernel upgrades, the occasional (rare) server moves, and full data center shutdowns to perform data center failover verification.

I guess when you grow up and get out of community college, you'll find that running a dormitory quake server is not a good example of a business critical production server. /pointed sarcasm.

Parent Share
twitter facebook

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

A hit for the other team... (Score:4, Interesting)

Re:Anyone want to clue them in to scheduled jobs? (Score:4, Interesting)

32 bit timer (Score:5, Interesting)

Check out this little pile of bullshit (Score:5, Interesting)

We used to joke (Score:4, Interesting)

Flaw left unfixed for too long? (Score:2, Interesting)

Re:In a related story (Score:2, Interesting)

Lessions from other Aviation Authorities (Score:5, Interesting)

Seen this week at various airports (Score:5, Interesting)

You insensitive clod (Score:5, Interesting)

Re:Anyone want to clue them in to scheduled jobs? (Score:5, Interesting)

Yea, but... (Score:3, Interesting)

poor guy.. (Score:2, Interesting)

Space Shuttle accidents and software bugs (Score:5, Interesting)

Re:Why not automate it? (Score:2, Interesting)

Re:Anyone want to clue them in to scheduled jobs? (Score:3, Interesting)

Re:Anyone want to clue them in to scheduled jobs? (Score:2, Interesting)

Re:Now even the submitters aren't reading the arti (Score:1, Interesting)

Uptime: From one of the artticle links (Score:5, Interesting)

Not necessarily Windows' fault (Score:5, Interesting)

Re:If it's in the job description... (Score:3, Interesting)

Re:Seen this week at various airports (Score:3, Interesting)

Re:But DON'T get into the habit of using reboot. (Score:3, Interesting)

I don't blame the OS per se... (Score:4, Interesting)

IBM product support kicks all ass. (Score:2, Interesting)

Downtime vs Failure (Score:5, Interesting)

Re:Repent, Sinners! (Score:4, Interesting)

Re:Now even the submitters aren't reading the arti (Score:3, Interesting)

Re:Repent, Sinners! (Score:3, Interesting)

What if OSS gave them software? (Score:3, Interesting)

Re:Anyone want to clue them in to scheduled jobs? (Score:3, Interesting)

Re:Repent, Sinners! (Score:4, Interesting)

Re:Seen this week at various airports (Score:2, Interesting)

They _do_ use Duct tape and baling wire (Score:3, Interesting)

49.7-day bug not exclusive to Windows. (Score:4, Interesting)

Re:Fire the Department of the Interior's IT staff. (Score:3, Interesting)

Re:It was the app, not the OS (Score:2, Interesting)

Can you imagine? (Score:2, Interesting)

Re:Repent, Sinners! (Score:3, Interesting)

Re:2K is based on NT kernel (Score:3, Interesting)

Re:Repent, Sinners! (Score:3, Interesting)

MS EULA Says.... (Score:1, Interesting)

A Win95 issue presents itself in W2K? (Score:1, Interesting)

Re:Lessions from other Aviation Authorities (Score:1, Interesting)

Hypocrites? (Score:3, Interesting)

Old OS/2 Bug, Not Windows 95 (Score:2, Interesting)

Re:Repent, Sinners! (Score:3, Interesting)

Related Links Top of the: day, week, month.

Slashdot Top Deals