Forgot your password?
typodupeerror
Bug Operating Systems Software United States Windows

Windows Upgrade, FAA Error Cause LAX Shutdown 862

Posted by michael
from the first-woodpecker-to-come-along dept.
fname writes "The recent shutdown of LAX due to an FAA radio outage was apparently caused by a Windows 2000 integration flaw, possibility related to an old Windows 95 bug. An article at the LA Times claims that the outage was caused by human error, as the system will automatically shut down after 49.7 days (related to this Windows 95 flaw?), and a technician didn't reboot the system monthly as he should have. This happened after an upgrade from Unix to Windows. I don't think blame should be assigned to the technician who missed the task; rather, it seems a gross oversight for the FAA to guarantee that such a critical system will crash after only one missed maintenance task. Who's really at fault?"
This discussion has been archived. No new comments can be posted.

Windows Upgrade, FAA Error Cause LAX Shutdown

Comments Filter:
  • Repent, Sinners! (Score:5, Insightful)

    by mfh (56) on Tuesday September 21, 2004 @05:49PM (#10313316) Journal
    The recent shutdown of LAX due to an FAA radio outage was apparently caused by a Windows 2000 integration flaw, possibility related to an old Windows 95 bug.

    Okay... a Win95 bug leads to the LAX shutdown because the *same* bug was later found in Win2k? Yup, closed source is the answer, Mr. Gates. I hereby repent my sins of Open Source Freedom and agree that security by obscurity is the answer! /sarcasm

    a technician didn't reboot the system monthly as he should have

    You have to love a system that requires downtime as part of uptime. How many Linux users have this problem? (Please press the Start button to shut down (stop) the computer.)
    • by LostCluster (625375) *
      I've seen AIX-based database systems that require an overnight downtime to do reindexing, since non-SQL formats like DBase have always been a little funky when they start having to deal with million-record tables. It's amazing how ugly legacy databases can be compared to today's tech.
      • by BlueUnderwear (73957) on Tuesday September 21, 2004 @06:17PM (#10313629)
        Was at JAOO today, and on the closing panel discussion for the Test-Driven Development track, Mr Kevlin Henney was praising NASA's rigorous software testing procedures. He was so proud of them that he let out a "and in both space shuttle crashes, software was not to blame". Well, this may be correct if he was thinking only about the flight software... but there [theinquirer.net] is other [usatoday.com] software [kiwiblog.co.nz] than what rides in the shuttle itself...

      • Downtime vs Failure (Score:5, Interesting)

        by burnin1965 (535071) on Tuesday September 21, 2004 @07:55PM (#10314484) Homepage
        I'm not sure exactly what downtime for routine maintenance on an AIX system running DBase has to do with a Windows bug that causes a system failure. However, in response, there is a difference between planned downtime where a service is made unavailable while planned routine maintenance is performed and planned downtime or an unplanned failure due to a flaw in the system.

        It appears that in this case Windows has a flaw which they try to work around with routine maintenance during planned downtime.

        In your case I would say you have planned downtime for routine maintenance to work around the need for an appropriate system to handle the work load.

        I suppose what is the same between these two cases is that you both need to change your system to something that is more appropriate for the task at hand. And to be more specific in the FCC case, Windows should not be allowed for use in any application where life, limb, or property is at risk. Hmm, I suppose that may rule out just about every use. :P

        burnin
      • Re:Repent, Sinners! (Score:4, Informative)

        by multipartmixed (163409) * on Tuesday September 21, 2004 @09:06PM (#10314969) Homepage
        > since non-SQL formats like DBase have always been
        > a little funky when they start having to deal
        > with million-record tables.

        Oh, yes, SQL the magic bullet. I have a database problem! No matter what it is, I can solve it by migrating to a database system which uses SQL!

        > It's amazing how ugly legacy databases can be
        > compared to today's tech.

        Yes, today's tech! SQL, the magic bullet! Why, we should use Oracle! It's SQL and thus must be modern! It's only been around since 1979!

        Wait!

        1979 was a long time ago.

        Oh, dear?

        Could it be that Oracle is not modern tech? But, how could it not be? It uses SQL, the magic bullet!

        Hint: query language and scalability are not related.
        Hint II: RDBMS is no magic bullet, either.
    • by Da Twink Daddy (807110) <bss03@volumehost.net> on Tuesday September 21, 2004 @05:57PM (#10313420) Homepage

      You have to love a system that requires downtime as part of uptime. How many Linux users have this problem? (Please press the Start button to shut down (stop) the computer.)

      Sure,

      init 6
      doesn't sound like it should start (initialize) anything...
      • Re:Repent, Sinners! (Score:3, Informative)

        by Phillup (317168)
        doesn't sound like it should start (initialize) anything

        So... it should not initialize (begin) run level 6?
        • by claar (126368) on Tuesday September 21, 2004 @06:29PM (#10313738)
          Bah, what a cop out. If "we" won't accept criticisms similar to our own, we have no right to criticize in the first place..

          Yes, init 6 is counter-intuitive. I remember that it actually did confuse me a bit the first time I heard of it. Does that mean we need to remove or change it? Nah, let 'em use `shutdown -r` or `alias restart="init 6"`. But just don't be an apologist for Linux, it just makes "us" look hypocritical.
    • by (H)elix1 (231155) <slashdot.helix@nOSPaM.gmail.com> on Tuesday September 21, 2004 @06:08PM (#10313546) Homepage Journal
      You have to love a system that requires downtime as part of uptime. How many Linux users have this problem?

      All right, I cannot throw the first stone here. I can raise my hand as a AIX C programmer back in the day...

      We inherited a huge ball of spaghetti wire, nasty stuff that had memory leaks. Rather than taking the time to fix it, the powers that be determined it was better to keep working on new features rather than hash out the issues. At first it happened once a quarter, then once a month, and as time ticked by a weekly 'fix' to recycle the server. Lord knows I added to the mix as well, as they picked 'cheap' and 'build it fast' (not to be confused with running fast), skipping the entire do it right. That is how it happens... stuff gets rushed before its time. OSS is more immune than the typical commercial gig, but anytime a deadline comes without enough time to finish something is going to give. Downtime is just duct tape.
    • by pchan- (118053) on Tuesday September 21, 2004 @06:21PM (#10313661) Journal
      where do you want to go today?

      dear microsoft,

      the above question was posed in a line of your advertisements well, after spending an hour and a half on a plane on the runway in oakland, and another hour on the runway in l.a. (sunday night), i think i have the answer. i want to go home. sounds like a simple enough request, or so i thought.

      but here is what i really want: i would like you (microsoft, inc.), to stop selling your products to mission critical and infrastructure operations until such a time as they are ready to do so. when my desktop computer at work crashes (admittedly a rare occurance nowadays), i am inconvenienced. when hundreds of thousands of travellers in airports across the world are delayed because one of the busiest airports in the world is shut down due to a 10 year old known bug in your operating systems that has not been fixed, that is simply not acceptable. i realize that buyers of software and IT systems are easily suckered or bribed into using your systems, that is why i am appealing directly to you. please exit this market before we are forced to legislate you out.

      thanks,
      pc
    • by Awptimus Prime (695459) on Tuesday September 21, 2004 @07:42PM (#10314380)
      You have to love a system that requires downtime as part of uptime. How many Linux users have this problem? (Please press the Start button to shut down (stop) the computer.)

      Well, in the past 10 years I have had a number of clients who have had Linux, Unix, Windows, and Mac systems that were critical to their day to day routine and they did nightly/weekly/monthly reboots as part of their maintenance.

      I guess when you grow up and get out of high school, you will find that your linux box running as a DSL router is not a good example of a production server.
    • Re:Repent, Sinners! (Score:4, Interesting)

      by Turmio (29215) on Tuesday September 21, 2004 @08:21PM (#10314680) Homepage
      You have to love a system that requires downtime as part of uptime. How many Linux users have this problem?
      Actually I was hit by the max 497 days uptime [jimohalloran.com] bug of Linux 2.4 (and with a desktop machine no less). The box at work did run for about 650 days but anyway well after the mile stone of half way journey for 2nd consecutive uptime reset. Then it was time for me to change rooms. I wasn't at office that day and my co-worker just unplugged the box. Was I pissed or not? Yes I was.
    • by Bush Pig (175019) on Tuesday September 21, 2004 @11:09PM (#10315700)
      I'm still having a bit of trouble with the notion that moving from UNIX to Windows was regarded as an upgrade.

  • by FyRE666 (263011) * on Tuesday September 21, 2004 @05:50PM (#10313324) Homepage
    It's obviously lunacy for any company to replace a proven system, which has given years of reliable service with some piece of trash that crashes if left running for over a month. That said, I was under the impression that a simple "at" job could be used on a Windows machine to run a script periodically (at is similar to cron, except far less capable, of course). Such a script could, if I'm not mistaken, be used to reboot the machine. One would think this would be an ideal way to hide the problem very nicely.

    We use a similar system to reboot all of our NT servers every weekend to help prevent crashes during the week (doesn't work of course, but still).
    • by TykeClone (668449) <TykeClone@gmail.com> on Tuesday September 21, 2004 @05:54PM (#10313378) Homepage Journal
      at sucks. Very, very much.

      I've got an NT server that would hang after 2 weeks. I set up an at job to restart that service nightly and do not have that problem.

      I've also got several linux servers that just plain run (and some NT/2000 servers as well).

      That being said, rebooting sometimes does clear up many evils. We have a speakerphone (around 10 years old - no OS) that just wouldn't work one day. After looking at it, I unplgged it and plugged it back in (I rebooted it!) and it worked. No good reason, it just helps.

    • by dbottaro (302069) on Tuesday September 21, 2004 @05:58PM (#10313432) Homepage Journal

      Agreed. A well written AT script something like this: Each M T W Th R S Su 12:45 AM shutdown /l /r /y /c

      Would do the trick... We have used that exact script for YEARS to nightly reboot a troublesome NT4 BDC at a remote location.

      While we knew that this was not a great solution, no one needed to access the server at that time of night. Any right minded IT person should be able to see the flaw in the FAA's logic.

    • by mekkab (133181) on Tuesday September 21, 2004 @06:10PM (#10313568) Homepage Journal
      It's obviously lunacy for any company to replace a proven system, which has given years of reliable service with some piece of trash that crashes if left running for over a month

      What if that proven systen is decaying out from under you? HD's failing, memory going bad... Tell you what, can you get me new boards for an IBM RT pc? I highly doubt it.

      What about "olde" mainframes running assembler code? The pool of expertise is drying up... sometimes you need to pitch the hardware.
    • by Ann Elk (668880) on Tuesday September 21, 2004 @06:11PM (#10313578)
      It's obviously lunacy for any company to replace a proven system, which has given years of reliable service...

      It's obvious you have never toured an ARTCC (Air Route Trafic Control Center). The system that is being replaced was barely hanging together by voodoo and chicken wire. It was designed back in the 60's to handle maybe 1/10th the current capacity. It is in dire need of replacement.

      That said, I'm not convinced Windows (or Linux for that matter) is an appropriate OS for an application that practically defines the phrase "mission critical".

      • by Anonymous Coward on Tuesday September 21, 2004 @07:26PM (#10314240)
        I used to write aviation message handling systems. We migrated from Tru64 (now extinct) to Linux and have had much better: performance, maintainability, hardware support, and reliability.

        Of course, the code leap from Tru64 to Linux is quite small, which is the biggest reason why Linux was chosen.

        Aviation expects 99.9999% uptime with absolutely no message loss, and we would achieve that with hot-standbys and MySQL mirroring. All circuits were split and would simultaneously enter both servers. Only the primary server would route the message.

        No, we didn't require the customer to reboot. The system could run for years at a time.

        Putting mission critical applications on Windows 95 is just plain stupid.
    • by Dun Malg (230075) on Tuesday September 21, 2004 @06:37PM (#10313807) Homepage
      Such a script could, if I'm not mistaken, be used to reboot the machine. One would think this would be an ideal way to hide the problem very nicely.

      For a real-time application like air traffic control, you really can't automate reboots like that. You need someone standing there to say "crap! crap! crap!" and take the necessary actions when the system decides it doesn't want to reboot properly.*

      *even if they don't know what to do, they can at least shout "crap!", which is more than a system stuck at the BIOS screen with an "elbow parity error" can say.

  • by jcr (53032) <jcr.mac@com> on Tuesday September 21, 2004 @05:50PM (#10313327) Journal
    Don't use this stuff in mission-critical applications.

    -jcr
    • "This stuff" being all of IT. HDs will fail within 5-7 years no matter what OS you put on them...

      Good IT is so hard to pull off because you have to convince people that events that strike once every few years have to be prepared for otherwise a disruption in service will occur.
  • "Upgrade"? (Score:5, Funny)

    by thelenm (213782) <mthelen@gmail.BOHRcom minus physicist> on Tuesday September 21, 2004 @05:50PM (#10313329) Homepage Journal
    "Upgrade" from Unix to Windows, eh. You keep using that word. I do not think it means what you think it means.
  • by Samir Gupta (623651) on Tuesday September 21, 2004 @05:51PM (#10313339) Homepage
    This is not an attack on Microsoft.

    But most off the shelf software have disclaimers expressly stating they are not to be used in mission critical situations. Eg:

    "technology is not fault tolerant and is not designed, manufactured, or intended for use or resale as on-line control equipment in hazardous environments requiring fail-safe performance, such as in the operation of nuclear facilities, aircraft navigation or communication systems, air traffic control, direct life support machines, or weapons systems, in which the failure of Java technology could lead directly to death, personal injury, or severe physical or environmental damage."

  • What?! (Score:5, Funny)

    by ottergoose (770022) on Tuesday September 21, 2004 @05:51PM (#10313340) Homepage
    I thought switching to Windows from *nix saved time, money, and hassle! Haven't you guys seen those banner ads here?
  • by LostCluster (625375) * on Tuesday September 21, 2004 @05:52PM (#10313344)
    When a ball drops on a baseball field at the midpoint between two positions, it's scored a "hit" for the opposition rather than an "error" against either player. Still, a hit for the other side is a bad thing for the entire team.

    This mess was big enough that there's a large enough supply of blame to give some to everybody involved.

    - No system should require a manual reboot on a regular basis... there should at least be a script capable of accomplishing that. But somehow, one got implemented. Blame whoever bought it.
    - Windows shouldn't have had a faw that required monthly reboots. Blame Microsoft.
    - Somebody should have done the reboots like they were told to. Blame that poor smuck.

    Bottom line is that everybody's at fault because had any one piece in the chain done their job properly the failure wouldn't have happened, but a cascade of mistakes lead to the ball hitting the grass instead of a glove.
    • by PPGMD (679725) on Tuesday September 21, 2004 @05:59PM (#10313436) Journal
      The patriot missile system had a similar problem. It's timing broke down after a period of time without a reboot (it was a much shorter cycle, either one day or one week).

      Microsoft isn't the only one to have issues like that. But it has been patched and there should have been more than enough time for the FAA to test and deploy the patch on the few legacy machines running Windows 95.

      I simply blame the FAA for wasting money away every year, billions are sunk into the system, but rarely does anything come out of it, Lockheed can deploy a complete new system to every airport for the amount of money that is being dumped into the old TRACONs and towers for MX.

    • Bottom line is that everybody's at fault because had any one piece in the chain done their job properly the failure wouldn't have happened, but a cascade of mistakes lead to the ball hitting the grass instead of a glove.

      An error is scored against a player if the player is determined to have been negligent in their position according to the rules. If someone hits a line drive right past the first baseman, it's still a hit. If the first baseman catches it, then drops it instead of making a tag, it's a

      • If multiple players are negligent, then multiple errors are scored. We've all seen "blooper" videos where there are cascading errors; one guy drops a catch, throws it to the next guy who drops it in turn, etc.

        Only one error can be scored per base advanced by the runner, and if the runner took first by a "hit" before the errant throw, then there is only one "error" for his advancement to second. If two players crash into each other and the ball drops, it's usually a hit because it's hard to say either woul
  • Heh (Score:4, Insightful)

    by GypC (7592) on Tuesday September 21, 2004 @05:52PM (#10313350) Homepage Journal

    upgrade from Unix to Windows

    AKA, "The PHB Special"

    Of course, the guy who was supposed to reboot the box will get all the blame. Shit rolls downhill.

    • Re:Heh (Score:5, Funny)

      by Nuclear Elephant (700938) on Tuesday September 21, 2004 @05:55PM (#10313390) Homepage
      It's an upgrade because it helps to create thousands of jobs for full-time system power cycling engineers.
    • Re:Heh (Score:5, Informative)

      by Michael Woodhams (112247) on Tuesday September 21, 2004 @07:42PM (#10314381) Journal
      There is a rather more extreme case of this with the FAA - when first deployed, the cargo doors of the DC-10 were unsafe, with a failure mode that was likely to make the plane uncontrolable in flight.

      This occured in flight, and through luck (which allowed some degree of control) and extraordinary airmanship, the plane was landed safely. (This is known as "The Windsor Incident.")

      McDonnell-Douglas didn't want to do a proper redesign of the door mechanism, and the FAA head was a 'companies know best' political appointee, so the result was McD added little windows to the door so that the guy closing the door could look to see it had all engaged properly. (This was over vigourous opposition by the NTSB, who recognized the inadequacy of the fix.)

      The situation: A single failure (not looking, or looking but not noticing an unsafe condition) by a non-safety trained close to minimum wage employee could cause the deaths of hundreds of people.

      Result: over 300 dead when a Turkish Airlines DC-10 crashed near Paris. The guy who closed the door hadn't even been told he was supposed to check the little windows.

      Safety critical systems must be tolerant of human error. If a single omission by a human leads to a hazardous situation, this is primarily the fault of the system, not the human.
  • by Billy Donahue (29642) on Tuesday September 21, 2004 @05:53PM (#10313357)

    To the rescue!
    http://www.nbc.com/LAX/ [nbc.com]
  • by Anonymous Coward on Tuesday September 21, 2004 @05:53PM (#10313364)
    "This happened after an upgrade from Unix to Windows."

    Thats the funniest thing I heard all day. Windows is an upgrade from unix. I almost choked on my coffee.
  • humans rule (Score:4, Insightful)

    by Doc Ruby (173196) on Tuesday September 21, 2004 @05:53PM (#10313367) Homepage Journal
    It is human error: those bugs didn't write themselves. Nor did the operations protocol that required "rebooting LAX" every 49.69(!) days. Nor did the upgrade procedure that ignored that bottleneck. Nor did the upgrade decision that moved from Unix to Windows. Those were all human errors, as was the decision to keep a job at LAX that would face blame for shutting down the airport (or risking lives) if the reboot was missed, or unsuccessful.

    "Not I," says the referee,
    "Don't point your finger at me.
    I could've stopped it in the eighth
    An' maybe kept him from his fate,
    But the crowd would've booed, I'm sure,
    At not gettin' their money's worth.
    It's too bad he had to go,
    But there was a pressure on me too, you know.
    It wasn't me that made him fall.
    No, you can't blame me at all."
    - Bob Dylan, "Who Killed Davey Moore?" [bobdylan.com]
  • by overbom (461949) <overbom@nospaM.yahoo.com> on Tuesday September 21, 2004 @05:54PM (#10313368)
    sleep 4294080
    shutdown /s
  • Why 49.7 days? (Score:5, Informative)

    by FirstTimeCaller (521493) on Tuesday September 21, 2004 @05:56PM (#10313392)

    Because there are 4294080000 millisconds in that time period. Just enough to cause a roll-over when using a 32 bit counter (and yes, 49.7 is an approximate value).

    Very few Win95 systems ever made it that long without a reboot... but you would've thought that it would've been fixed by Windows 2000.

  • by rasafras (637995) <tamas&pha,jhu,edu> on Tuesday September 21, 2004 @05:56PM (#10313393) Homepage
    ...keep in mind that we have established numerous times that windows is not suitable for systems that need reliability and stability. It is not the operating system's fault that this happened, it is the FAA's for choosing to use it instead of considering the better alternatives. If you get run over on a bicycle while riding on the highway, don't blame the bike.
    Quick addition: it seems that the fault does not belong entirely to windows, but rather a combination of the software running on it and the system architecture.

    With that said, Windows could stand to improve a lot. It has too many bugs, too many flaws, and so on. And it definitely does not have a stable, secure, reliable base. So don't expect it to.
  • 32 bit timer (Score:5, Interesting)

    by charnov (183495) on Tuesday September 21, 2004 @05:57PM (#10313419) Homepage Journal
    This old error was from the use of a 32 bit 1 ms increment timer (comes out to 49.7 days until rollover). AFAIK, this was fixed in Win2k and above when the timer got bumped to 64 bit. Maybe whoever set up LAX was using some ancient legacy middleware that used the old timer. This is just bizarre. In both locations that I have worked the last three years, none of the Win2k or Win2k3 servers went down ever. Sounds like bad consultants.
    • Re:32 bit timer (Score:5, Informative)

      by Draknor (745036) on Tuesday September 21, 2004 @06:06PM (#10313530) Homepage
      Parent is right - its not a bug in Windows itself, but rather a piece of software running on Windows - from (one of the)FA's:

      Richard Riggs, an advisor to the technicians union, said the FAA - the American aviation regulator - had been planning to fix the program for some time. "They should have done it before they fielded the system," he said.

      (emphasis added)
  • by Trailer Trash (60756) on Tuesday September 21, 2004 @05:58PM (#10313431) Homepage

    The system offers unprecedented voice quality, touch-screen technology, dynamic reconfiguration capabilities to meet changing needs, and an operational availability of 0.9999999

    Okay, bullshit. If I have to reboot a server every month, .0000001 of a month is- oh, let's be generous and only count months with 31 days- about .26 seconds. That's a damned fast boot time for Win2K.

    Maybe they left off a percent sign?

  • We used to joke (Score:4, Interesting)

    by multiplexo (27356) * on Tuesday September 21, 2004 @05:59PM (#10313440) Journal
    that no one would ever run into the 49.7 day bug on a Windows system because the chances of having that much uptime were slim to none. Having a system where you know that things are broken and you have to reboot it every 30 days to keep it from breaking down is a bad thing, deploying such a system into a production environment is even worse (but it's been done, I don't know how many times I wrote cron jobs to kill bad pieces of software and restart them) but deploying such a system in an environment where lives are at stake is completely inexcusable, regardless of whether or not it is closed or open source. This is similar to having a circuit in your house that overheats because occasionally too much load is placed on it. The idiot solution is to reset the breaker when it trips, the correct solution is to put in a bigger circuit that can handle the peak load. This vendor provided the idiot solution to this problem and should be punished for it, this never should have been deployed, I can only hope that they won't blame the technician for failing to do something that he wouldn't have had to do if the system had been designed properly.

    I also love the statement that the system was upgraded from UNIX to Windows. Isn't this kind of like upgrading from being in very good health but not being good looking to being somewhat good looking but suffering from cancer, AIDS and heart disease?

  • 49.7 days (Score:5, Funny)

    by k4_pacific (736911) <{moc.oohay} {ta} {cificap_4k}> on Tuesday September 21, 2004 @05:59PM (#10313443) Homepage Journal
    I remember back when that bug was announced. Seems it was at least a couple of years after Windows 95 had been out. I guess they had to work through a lot of other bugs to get Windows 95 to make it long enough for this bug to occur.
  • by Ann Elk (668880) on Tuesday September 21, 2004 @06:03PM (#10313493)

    OK, I know it's violation of /. policy to actually read a referenced article. My bad. But, according to the software.silicon.com article:

    Richard Riggs, an advisor to the technicians union, said the FAA - the American aviation regulator - had been planning to fix the program for some time. "They should have done it before they fielded the system," he said.

    This sounds to me like more of a problem with the application, not the OS. The "system" crashed after 49.7 days, which is about 4 million seconds, which is about 4 billion milliseconds, which is (obviously) MAX_ULONG. I suspect the application is using a ULONG to store a timeout value and got pissed-off when it rolled over.

    • by tyler_larson (558763) on Tuesday September 21, 2004 @07:05PM (#10314085) Homepage
      This sounds to me like more of a problem with the application, not the OS.

      Three words:

      GetTickCount()

      Returns the number of milliseconds since the machine was last booted.

      From reading the article, one would surmise that this function is used to assign a timestamp to a particular flight plan or other record. After the machine has been running for 49.7 days, the GetTickCount() function rolls over to zero, which could cause a whole plethora of problems. Almost certainly those problems would include things like corruption of data, lost records, old records showing up as new, application crashes, and, of course, swarms of locusts. The only fix is to reboot.

      The developers cleverly noticed the potential disaster before it crashed any planes, and as a workaround, instituted a policy requiring the servers to be rebooted at monthly intervals. Failure to do so would result in the calamities described above.

      So while the problem wasn't the old Win95 bug, it was the same crappy windows API that caused both. The POSIX-compliant gettimeofday() function uses a 64-bit structure and does not suffer from the same flaw, and can be relied upon for at least the next 30 years or so (which isn't amazing, but it's a lot better than 50 days).

      Note that the FAA insists that they're currently implementing a better solution than "reboot every month". Better hurry, guys, you've only got 47.3 days left.

    • by WebCowboy (196209) on Tuesday September 21, 2004 @07:37PM (#10314336)
      ...but I blame a lot of people for carelessness and incompetence (except for the actual techie that forgot to reboot last month--that is an honest mistake).

      * Bill Gates and developers of Win2000 for the convoluted, kludgy API they designed for their OS

      * Product managers at Harris--the crap-for-brains who actually thought changing out robust UNIX servers that weren't really THAT old with consumer-grade PCs running an unproven OS was an UPGRADE to a critical, safety related system. WHAT THE HELL WERE THEY THINKING? In one of the article links (the Harris press release), Harris touted SEVEN NINES reliability! If that was a criteria they should've NEVER considered Windows...Not even BillG himself would say Win2k could provide that sort of uptime!

      * Retarded developers at Harris who used an API call that tracks milliseconds in a 32 bit integer despite the fact that bugs related to the use of said function call were WELL KNOWN by that time.

      * Dough-heads at LAX and the FAA who, upon finding the error early in development, decided it was OK to rely on MANUAL MONTHLY REBOOTS as a workaround to a potentially fatal problem. They should've run the "upgraded" windows machines in parallel with the UNIX servers for much longer, and failing that they should've IMMEDIATELY restored the old UNIX servers to service as soon as the problem was discovered, and to refuse the upgrade (and revoke payment to Harris) until the problem was properly resolved (and NOT just worked around with a kludge like an email reminder to reboot, or a reboot script or a shutdown warning either).

      I'm surprised that this sort of error got into such a critical system, and at the way it was handled. I would've certainly tested the new system in parallel for long enough to catch this sort of error and kept the old system around for longer as a standby (in my experience, replacements of critical systems were often tested in parallel for 3 months to a year). I also would've acted much more decisively in resolving the problem if it did slip through the cracks, given a system crash could put lives in danger.

      Maybe my girlfriends fear of flying is more justified than I thought if these are the kind of clowns we trust our safety to...
  • by akiy (56302) on Tuesday September 21, 2004 @06:05PM (#10313516) Homepage
    I believe the 49.7 days of uptime for a Windows 95 box is a new record, shattering the previous record in Norway of 27.9 days back on January through February of 2001. Congratulations!
  • by MosesJones (55544) on Tuesday September 21, 2004 @06:06PM (#10313523) Homepage

    I worked for around 5 years in Air Traffic Control projects, both in delivery of radar processing and displays and in R&D for next generation systems.

    Let me give you an overview of the failure approach of just one of those systems.

    1) Everything on Unix, ruggedised releases of UNIX

    2) Every box must be able to FAIL ON ITS OWN

    3) Every box must have a direct replacement, or replacements, which carry the SAME LOAD.

    4) ZERO total system downtime allowed, partial systems failures are allowed, but core systems must keep running.

    5) 5 stages of power supply failure, double mains, double generation and lastly a great big warehouse of car batteries if all else fails.

    6) 4 Years of testing of FULL system before live.

    This is what is normal when safety is the primary concern. What the FAA decision sounds like is a cost driven process which chose the cheapest solution that "could" meet the requirements.

    The idea of a safety critical (if it fails people could die) system that requires a reboot is fine in only one case... if it can be non-operational on a regular basis, in which case it should be done EVERY non-operational window (say every week) , this is therefore okay for some hospital scanners that are certified for 12 hour runs. Its not okay for a 24/7 system that controls objects flying around at 500 miles an hour.

    Welcome to the US... we will be landing slightly quicker than expected.
  • by whoever57 (658626) on Tuesday September 21, 2004 @06:07PM (#10313536) Journal
    This week, while flying, I saw:
    1. Windows-based terminal used by the public to print tickets (I think) with a "you have chosen to download a file, what do you want to do with it: save, open" or similar (I don't recall the exact wording).

    2. A windows-based machine that was part of the baggage scanning setup at Chicago-O'Hare going through a scandisk process. OK, this may have been due to operators turing the machine off using the power switch, but should not such a machine use a read-only boot drive/partition?

    Do you feel more secure?
  • by art123 (309756) on Tuesday September 21, 2004 @06:08PM (#10313550)
    There is no such thing as a Windows 2000 49.7 day bug that causes an OS problem.

    The problem here is the software made by Harris does not handle a rollover of the GetTickCount() function turning back to 0. This function counts the number of milliseconds since the OS was last booted so it should be obvious to anybody that the returned unsigned 4 byte integer cannot go on forever.

    So the badly written Harris software has this bug and their solution (which was really not that bad of a work around) was to manually reboot the system every 30 days, but as a fail-safe, they had a scheduled task to do a reboot on the 49th day just in case. The 49th day came because of procedural error.

    There is nothing Microsoft could do to prevent this.
  • by plopez (54068) on Tuesday September 21, 2004 @06:13PM (#10313598) Journal
    http://msdn.microsoft.com/library/default.asp?url= /library/en-us/sysinfo/base/gettickcount.asp

    Sounds like who ever wrote the software/OS module they were relying on used this gem. I hereby dub who soever was so silly as to do this as a 'code monkey, first class'.
  • by Eric Seppanen (79060) on Tuesday September 21, 2004 @06:15PM (#10313611)
    Headline:
    Microsoft server crash nearly causes 800-plane pile-up
    failure to restart system caused data overload

    giant advertisement:

    Make a name for yourself with Windows Server System
    I'm thinking that maybe "the guy that almost crashed a bunch of planes" is not the name they were looking for.

    (I'm not making this up- that's really the ad I'm seeing.)

  • by Mateito (746185) on Tuesday September 21, 2004 @06:28PM (#10313727) Homepage
    The system offers unprecedented voice quality, touch-screen technology, dynamic reconfiguration capabilities to meet changing needs, and an operational availability of 0.9999999.

    Whoah! 7 nines uptime!

    22 seconds of downtime per year.

    Somebody is on drugs if they sold that. Somebody is on even stronger drugs if they bought that story.

    "5 nines", for all intents and purposes, is as good as it gets, with "6 nines" seen as the holy grail. The top HA system I've ever dealt with (running a Telco's billing operation spanning 4 countries!) quoted a figure of 0.999996. To nobody's suprise, it did not run Windows.

    Wonder how much their failure clause is going to set them back?

  • by DunbarTheInept (764) on Tuesday September 21, 2004 @06:29PM (#10313732) Homepage
    While I hate MS as much as the next guy, this might not really be directly their fault. Unix systems are often installed with the instruction taht they get reboots regularly. Often there is a problem that is caused by application code not the OS. If you have a memory leak in an application that runs and stays up all the time, it's going to cause the system to get horribly unusalbe in the long run regardless of whether it's UNIX or Windows. While a reboot might be overkill when it was just one application misbehaving, a reboot is a guaranteed way to kill and reset the responsible program no matter which one it is. At a previous place of employment we told the customer to do monthly reboots mainly because we didn't trust *our own* code to be that perfect.
  • by craXORjack (726120) on Tuesday September 21, 2004 @06:48PM (#10313930)
    Ladies and Gentlemen, at this time the Captain would like to ask you to remain seated with your seatbelt firmly fastened, however if there are any computer technicians flying with us today, especially if they know what to do when a 'Fatal Exception has occured at 0029:C02FDEC6', would that person please come forward to the cabin immediately?
  • What failed? (Score:5, Insightful)

    by AK Marc (707885) on Tuesday September 21, 2004 @06:50PM (#10313948)
    A system was deployed where the application (not the OS) failed after a finite time was deployed knowing it was faulty. An under-trained technician failed to reboot the server as scheduled. There was a backup which we don't have details on. It failed to work as well.

    I don't see what the OS has to do with this. It could have been written for *NIX, OS/2, or any other OS. The lessons are two:
    Don't deploy flawed software.
    Make sure redundant systems work.

    As an aside, since we don't know what the backup was, we could hypothetically say that it was the UNIX system that previously was primary that was relegated to backup duty. In that case, it would be a failure of Windows and UNIX at the same time. So, is it that UNIX sucks and is worthless for any important systems, or is it that the people that screwed this up would have screwed up something, no matter what OS they were working with?
  • by Teahouse (267087) on Tuesday September 21, 2004 @07:01PM (#10314038)
    Pilot here, and this has been a well known pecadillo of the tracking system for SoCal Approach for a few years. It's an application problem that came into being after an upgrade of the application, not the OS. It's a memory allocation error that retains some of the old tracking on the system, thus, the whole box needs to be rebooted every 45 days or the memory overloads and crashes the OS. Look guys, I'm a Linux user and all, but let's not run around blaming M$ for problems with buggy software apps.

  • by techsoldaten (309296) on Tuesday September 21, 2004 @07:09PM (#10314113) Journal
    Since we are being technical about the answer, does this mean Microsoft or the software vendor qualifies as a terrorist organization?

    Consider the fact that an entire airport was shut down, lives were disrupted, major economic harm was caused our airlines as a result of flights not getting out on time. LAX is a major hub that connects travelers throughout the country, it is conceivable traffic patterns throughout the U.S. were put out by this problem.

    Think of it like a car bomb that went off without anyone dying, and you see my point.

    M
  • by Dr.Dubious DDQ (11968) on Tuesday September 21, 2004 @07:55PM (#10314489) Homepage

    The FAA is under the auspices of the US Department of the Interior, aren't they? You know, the same department that was ordered by a court to take ALL of their systems off line because they were apparently unable to secure them [washingtonpost.com]? TWICE [geotimes.org]? (No, wait, the latter link says THREE times, most recently March 2004...!)

    Is there some secret plot to make them look bad, or is the Department of the Interior riddled with incompetence? I certainly don't feel real secure about the safety of our airlines right now - and it's got nothing to do with "terrorists"...

    (Not to say that terrorism isn't a real concern, but I'm somewhat less worried that their intentional plots will slip through observation by the authorities than "accidental" screwed up software being deployed by the FAA...)

  • by jayhawk88 (160512) <jayhawk88@gmail.com> on Tuesday September 21, 2004 @08:12PM (#10314610)
    I don't think blame should be assigned to the technician who missed the task...

    Boss: OK Tech, it's your job to see to it this computer is rebooted monthly.
    Tech: Will do Boss!
    *Time Passes, System Crashes*
    Boss: The system crashed, why is that?
    Tech: Well, it's because I didn't reboot the system like I should have.
    Boss: Oh well, I guess it's not your fault, obviously I failed to realize maximum security synergy in my systems.

    Wherever the submitter works, I wanna get a job there!
  • Maitainance. (Score:4, Insightful)

    by Zebra_X (13249) on Tuesday September 21, 2004 @09:41PM (#10315195)
    it seems a gross oversight for the FAA to guarantee that such a critical system will crash after only one missed maintenance task.

    Would you feel this way if the airplane that you were flying in missed it's engine overhaul time, the engined failed catastrophically and your plane crashed?

    Critical System + Maitainance = Must Be Done.

    The system was designed and setup in a particular manner. In fact, the reboot rule was added to the design of the system, so that this very thing would not happen.

    Whoever's job it was to reboot the machine is at fault for not maintaining the system properly.

    The discussion of whether the procedure of rebooting a machine every month is inane, is something different.
  • by Temporal (96070) on Tuesday September 21, 2004 @11:23PM (#10315789) Journal
    It may seem suspicious that the max uptime of the LAX system is the same as the max uptime of a Windows 95 box... until you realize that 49.7 days is 2^32 milliseconds. If you have a piece of software that counts milliseconds using a 32-bit integer, it will inevitably roll over after 49.7 days and -- unless designed to compensate for it -- will probably crash. Windows 95 is certainly not the only piece of software that counts milliseconds in a 32-bit integer.

    That said, the Windows GetTickCount() system call returns a timer value as a 32-bit count of milliseconds since the system was booted. Now, any good programmer knows better than to use GetTickCount() -- there are other, better, more robust ways to tell time in Windows -- but it would not surprise me if a newbie had made the mistake of using this system call in the LAX software, thus leading to the problems.

    In other words, the Windows timer is not at fault, but it is possible that one of the programmers was confused by the convoluted Win32 API and made a programming error as a result.

HOST SYSTEM NOT RESPONDING, PROBABLY DOWN. DO YOU WANT TO WAIT? (Y/N)

Working...