Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
Bug Operating Systems Software United States Windows

Windows Upgrade, FAA Error Cause LAX Shutdown 862

fname writes "The recent shutdown of LAX due to an FAA radio outage was apparently caused by a Windows 2000 integration flaw, possibility related to an old Windows 95 bug. An article at the LA Times claims that the outage was caused by human error, as the system will automatically shut down after 49.7 days (related to this Windows 95 flaw?), and a technician didn't reboot the system monthly as he should have. This happened after an upgrade from Unix to Windows. I don't think blame should be assigned to the technician who missed the task; rather, it seems a gross oversight for the FAA to guarantee that such a critical system will crash after only one missed maintenance task. Who's really at fault?"
This discussion has been archived. No new comments can be posted.

Windows Upgrade, FAA Error Cause LAX Shutdown

Comments Filter:
  • by Samir Gupta ( 623651 ) on Tuesday September 21, 2004 @05:51PM (#10313339) Homepage
    This is not an attack on Microsoft.

    But most off the shelf software have disclaimers expressly stating they are not to be used in mission critical situations. Eg:

    "technology is not fault tolerant and is not designed, manufactured, or intended for use or resale as on-line control equipment in hazardous environments requiring fail-safe performance, such as in the operation of nuclear facilities, aircraft navigation or communication systems, air traffic control, direct life support machines, or weapons systems, in which the failure of Java technology could lead directly to death, personal injury, or severe physical or environmental damage."

  • by Embedded2004 ( 789698 ) on Tuesday September 21, 2004 @05:54PM (#10313374)
    Well, if it is running windows, and somehow someone made a mistake and desided to run it on some mission critical system, they should reghost it as often as they can.

    Windows has an odd tendancy to corrupt it self.
  • Why 49.7 days? (Score:5, Informative)

    by FirstTimeCaller ( 521493 ) on Tuesday September 21, 2004 @05:56PM (#10313392)

    Because there are 4294080000 millisconds in that time period. Just enough to cause a roll-over when using a 32 bit counter (and yes, 49.7 is an approximate value).

    Very few Win95 systems ever made it that long without a reboot... but you would've thought that it would've been fixed by Windows 2000.

  • by dbottaro ( 302069 ) on Tuesday September 21, 2004 @05:58PM (#10313432) Homepage Journal

    Agreed. A well written AT script something like this: Each M T W Th R S Su 12:45 AM shutdown /l /r /y /c

    Would do the trick... We have used that exact script for YEARS to nightly reboot a troublesome NT4 BDC at a remote location.

    While we knew that this was not a great solution, no one needed to access the server at that time of night. Any right minded IT person should be able to see the flaw in the FAA's logic.

  • by pyro101 ( 564166 ) on Tuesday September 21, 2004 @05:59PM (#10313442) Homepage
    I don't know about using windows 95, but here at the nuclear facility that I work at we use not only Java but also windows. Have been using windows for some time and have to use java because that is the way Oracle is going. We have more problems with hardware issues then with the off the shelf software , but no matter what problems we get from any of it we as software developers are supposed to anticipate it and prove that we can, within reason catch the user/machine/other devices before screwing stuff up. But most of all we go through huge testing on any small addition or change to the code base, even changing color on menus requires a 10-20 signitures (never know what else could have been added on accident).
  • Re:Why 49.7 days? (Score:5, Informative)

    by Holi ( 250190 ) on Tuesday September 21, 2004 @06:02PM (#10313481)
    It was this issue has nothing to do with the Win95 bug, It was just the submitters opinion (which happens to be very wrong)
  • Re:Repent, Sinners! (Score:3, Informative)

    by Phillup ( 317168 ) on Tuesday September 21, 2004 @06:05PM (#10313519)
    doesn't sound like it should start (initialize) anything

    So... it should not initialize (begin) run level 6?
  • Re:32 bit timer (Score:5, Informative)

    by Draknor ( 745036 ) on Tuesday September 21, 2004 @06:06PM (#10313530) Homepage
    Parent is right - its not a bug in Windows itself, but rather a piece of software running on Windows - from (one of the)FA's:

    Richard Riggs, an advisor to the technicians union, said the FAA - the American aviation regulator - had been planning to fix the program for some time. "They should have done it before they fielded the system," he said.

    (emphasis added)
  • by art123 ( 309756 ) on Tuesday September 21, 2004 @06:08PM (#10313550)
    There is no such thing as a Windows 2000 49.7 day bug that causes an OS problem.

    The problem here is the software made by Harris does not handle a rollover of the GetTickCount() function turning back to 0. This function counts the number of milliseconds since the OS was last booted so it should be obvious to anybody that the returned unsigned 4 byte integer cannot go on forever.

    So the badly written Harris software has this bug and their solution (which was really not that bad of a work around) was to manually reboot the system every 30 days, but as a fail-safe, they had a scheduled task to do a reboot on the 49th day just in case. The 49th day came because of procedural error.

    There is nothing Microsoft could do to prevent this.
  • by jbwolfe ( 241413 ) on Tuesday September 21, 2004 @06:08PM (#10313555) Homepage
    Hey, I submitted this two days ago. What makes it slashdot worthy now?
  • by Kehvarl ( 812337 ) on Tuesday September 21, 2004 @06:11PM (#10313577)
    The shutdown wasn't the problem, or more appropriately, the shutdown that would have prevented the problem was missed. But also the FAA's software probably has some issues of its own that need to be fixed.

    On a completely different subject, I move that any post containing a phrase along the lines of "This is going to get me moderated as Troll" be automatically moderated Troll. Too many of us seem to use it becasue it tends to lead to the opposite result.
  • Try telinit (Score:2, Informative)

    by TheScienceKid ( 611371 ) on Tuesday September 21, 2004 @06:21PM (#10313658)
    What you may not have taken the time to observe is that when you run init with a name of telinit or with a process ID other than 1 it runs in 'telinit' mode. In this mode it passes a message via /dev/initctl (a FIFO) to tell the running copy of 'init' (the process responsible for initialising services and managing them thereafter) to perform a specific action (eg shutdown, reboot... etc)
  • by larien ( 5608 ) on Tuesday September 21, 2004 @06:23PM (#10313675) Homepage Journal
    Welcome to planned vs unplanned downtime; in many cases, a 10 hour outage can still give you a 100% availability if you planned that outage. What they're probably quoting is 0.0000001 unplanned downtime.

    Lies, damned lies and availability stats...

  • A few remarks (Score:3, Informative)

    by bmajik ( 96670 ) <matt@mattevans.org> on Tuesday September 21, 2004 @06:24PM (#10313691) Homepage Journal
    1) this is not a windows OS bug

    GetTickCount() will rollover. An _application_ which assumes it is a strictly increasing value will misbehave after the 40 some odd days expire. That appears to be what is happening here.

    Note that nowhere in the article is there a distinction between the "system" and the "OS" or the "application".

    2) Regardless of where the fault is (hint: it's not in Windows), it is not unreasonable for a machine to need servicing. Aircraft engines are serviced at hour based intervals, wether they need it or not. It's better to just tear the thing down and rebuild it than to have it tear itself apart. software doesn't _have_ to be this way, but it sometimes is.

    Making a complete hardware -> app layer stack 100% failsafe is.. tricky. For some applications, designing the system with a known restart point.. i.e. a reboot of the app or the entire machine, can be more cost effective.. (see earlier the paper on crash-only software design)..a periodic shutdown/restart in complicated systems can be a valid operational practice.

    The fault here is two fold - one, the application/system had a known issue that is probably avoidable, but for whatever reasons, it still has the issue.

    Knowing that the issue existed, the proper maintennace was not observed with the expected result - a failure.

    Only in america do you get away with blaming Audi for oil sludge problems when you dont change your oil every maintenace interval.

    If the system called for a 48th day restart, thats what it requires, and deviation from that has consequences. Luckily no one was hurt.

  • by LostCluster ( 625375 ) * on Tuesday September 21, 2004 @06:28PM (#10313721)
    As many others have pointed out here, it's the same bug that brought down Windows 9x reappearing.

    Just like the "Y2K glitch" was a platform independant problem based upon the 2-digit-year shorthand causing logical flaws, if you store time in a 32-bit variable by the microsecond... you'll hit the hard limit after about 49.7 days which is why that number can show up in kernels other than Win9x. If there's no proper handling of that rollover, things go haywire.
  • Re:Retard (Score:5, Informative)

    by Keith Russell ( 4440 ) on Tuesday September 21, 2004 @06:32PM (#10313757) Journal

    Search Microsoft's Knowledge Base for "49.7 days", and you'll find a few bugs, all of them related to storing uptime in milliseconds in an unsigned 32-bit integer. Two were reported in Windows 2000:

    That rpcss.exe issue looks like a prime suspect. The OS doesn't crash, but, given the time-sensitive nature of air traffic control data, it's quite possible that the applications running on that server would degrade to the point of failure.

    Both look like they were found, or at least entered into the KB, after the release of Windows 2000 Service Pack 4 (Nov. 2003), and hotfixes are available for both.

    Note to Microsoft (or anyone else storing milliseconds, for that matter): unsigned 64-bit int! Instead of having to reboot every 49.7 days, you'll have to reboot every 213,503,982,334 days, give or take a leap-second.

  • Re:Why 49.7 days? (Score:5, Informative)

    by AK Marc ( 707885 ) on Tuesday September 21, 2004 @06:33PM (#10313768)
    and yes, 49.7 is an approximate value

    The exact value is 49 and 59,929/84,375 days, or 49 days, 17 hours, 2 minutes, and 47.296 seconds (exact).
    Hey, news for nerds, what did you expect...
  • Re:32 bit timer (Score:3, Informative)

    by djwolf ( 6102 ) on Tuesday September 21, 2004 @06:34PM (#10313773)
    The timer has not been incremented to 64bit. The reason is for api compatibility it hasn't been changed. Microsoft does give you some warning though:

    GetTickCount

    The GetTickCount function retrieves the number of milliseconds that have elapsed since the system was started. It is limited to the resolution of the system timer. To obtain the system timer resolution, use the GetSystemTimeAdjustment function.

    DWORD GetTickCount(void);

    Parameters
    This function has no parameters.
    Return Values
    The return value is the number of milliseconds that have elapsed since the system was started.

    Remarks
    The elapsed time is stored as a DWORD value. Therefore, the time will wrap around to zero if the system is run continuously for 49.7 days.

    If you need a higher resolution timer, use a multimedia timer or a high-resolution timer.

    To obtain the time elapsed since the computer was started, retrieve the System Up Time counter in the performance data in the registry key HKEY_PERFORMANCE_DATA. The value returned is an 8-byte value. For more information, see Performance Monitoring.

    Example Code
    The following example demonstrates how to use a this function to wait for a time interval to pass. Due to the nature of unsigned arithmetic, this code works correctly if the return value wraps one time. If the difference between the two calls to GetTickCount is more than 49.7 days, the return value could wrap more than one time and this code will not work; use the system time instead.

    DWORD dwStart = GetTickCount(); // Stop if this has taken too long
    if( GetTickCount() - dwStart >= TIMELIMIT )
    Cancel();
    Example Code
    Note that TIMELIMIT is defined as the time interval of interest to the application, in milliseconds.

    Requirements
    Client: Requires Windows XP, Windows 2000 Professional, Windows NT Workstation, Windows Me, Windows 98, or Windows 95.
    Server: Requires Windows Server 2003, Windows 2000 Server, or Windows NT Server.
    Header: Declared in Winbase.h; include Windows.h.
    Library: Use Kernel32.lib.
  • Re:What?! (Score:3, Informative)

    by drew ( 2081 ) on Tuesday September 21, 2004 @06:41PM (#10313844) Homepage
    Funniest thing is that was actually the ad i saw when i read one of the linked articles :)
  • by Ayanami Rei ( 621112 ) * <rayanami AT gmail DOT com> on Tuesday September 21, 2004 @06:48PM (#10313928) Journal
    It's probably not a Microsoft problem if the system is running on NT, it uses a 64-bit time.

    It _could_ be that an important part of the system is running Windows 95 interfaced to a 2k domain that implements the rest of the system.
    That really isn't Microsoft's fault that they didn't patch that critical machine to fix the flaw... or that they felt they needed to run Windows 95 (gag) in such a critical portion of the system.

    It _could_ be that a user-land air traffic control related application itself calls an depricated API to return the time in microseconds, which
    overflows/wraps around, causing the software to crash.
    OR
    It _could_ be that the user-land air traffic control software just mis-casts the time from the modern API into a 32-bit data structure, which wraps around, causing the software to crash.
    In the latter two cases the article writer or LAX's press staff may have incorrectly drawn the connection to the famous Windows 95 problem... even when it wasn't Microsoft's fault in that case.

    I really don't see how Microsoft could be the blame here at all...
  • by Teahouse ( 267087 ) on Tuesday September 21, 2004 @07:01PM (#10314038)
    Pilot here, and this has been a well known pecadillo of the tracking system for SoCal Approach for a few years. It's an application problem that came into being after an upgrade of the application, not the OS. It's a memory allocation error that retains some of the old tracking on the system, thus, the whole box needs to be rebooted every 45 days or the memory overloads and crashes the OS. Look guys, I'm a Linux user and all, but let's not run around blaming M$ for problems with buggy software apps.

  • by Anonymous Coward on Tuesday September 21, 2004 @07:04PM (#10314072)
    I'm pretty sure that the stat is
    still blown wide open. With a
    allowable downtime of 30 seconds
    per year, the recent ~12600 second
    outage means they are probably not
    at the promised .9999999 uptime,
    unless the system was actually
    brought up four hundred years ago,
    and this was it's first unplanned
    outage.
  • Re:Repent, Sinners! (Score:2, Informative)

    by jurv!s ( 688306 ) on Tuesday September 21, 2004 @07:17PM (#10314169) Journal
    in my labs- users logged in on the console can reboot without sudo. Anything less would be uncivilized!

    (ps man console.apps and pam_console)
  • by Anonymous Coward on Tuesday September 21, 2004 @07:26PM (#10314240)
    I used to write aviation message handling systems. We migrated from Tru64 (now extinct) to Linux and have had much better: performance, maintainability, hardware support, and reliability.

    Of course, the code leap from Tru64 to Linux is quite small, which is the biggest reason why Linux was chosen.

    Aviation expects 99.9999% uptime with absolutely no message loss, and we would achieve that with hot-standbys and MySQL mirroring. All circuits were split and would simultaneously enter both servers. Only the primary server would route the message.

    No, we didn't require the customer to reboot. The system could run for years at a time.

    Putting mission critical applications on Windows 95 is just plain stupid.
  • Re:Heh (Score:5, Informative)

    by Michael Woodhams ( 112247 ) on Tuesday September 21, 2004 @07:42PM (#10314381) Journal
    There is a rather more extreme case of this with the FAA - when first deployed, the cargo doors of the DC-10 were unsafe, with a failure mode that was likely to make the plane uncontrolable in flight.

    This occured in flight, and through luck (which allowed some degree of control) and extraordinary airmanship, the plane was landed safely. (This is known as "The Windsor Incident.")

    McDonnell-Douglas didn't want to do a proper redesign of the door mechanism, and the FAA head was a 'companies know best' political appointee, so the result was McD added little windows to the door so that the guy closing the door could look to see it had all engaged properly. (This was over vigourous opposition by the NTSB, who recognized the inadequacy of the fix.)

    The situation: A single failure (not looking, or looking but not noticing an unsafe condition) by a non-safety trained close to minimum wage employee could cause the deaths of hundreds of people.

    Result: over 300 dead when a Turkish Airlines DC-10 crashed near Paris. The guy who closed the door hadn't even been told he was supposed to check the little windows.

    Safety critical systems must be tolerant of human error. If a single omission by a human leads to a hazardous situation, this is primarily the fault of the system, not the human.
  • Re:Repent, Sinners! (Score:4, Informative)

    by Phillup ( 317168 ) on Tuesday September 21, 2004 @07:42PM (#10314390)
    Only if that is what you have run level 6 configured to do.

    All the init 6 command does is initialize run level 6. You can have run level 6 configured any way you want.

    It isn't hard wired to shut down. (On debian run level 6 does a reboot... run level 0 halts the system.)
  • by agallagh42 ( 301559 ) on Tuesday September 21, 2004 @08:02PM (#10314544) Homepage
    "Since when does Windows 2000 include a "shutdown" command?"

    Uh, since about 2000 I believe.:)
    C:\>shutdown /?
    Usage: shutdown [-i | -l | -s | -r | -a] [-f] [-m \\computername] [-t xx] [-c "c
    omment"] [-d up:xx:yy]

    No args Display this message (same as -?)
    -i Display GUI interface, must be the first option
    -l Log off (cannot be used with -m option)
    -s Shutdown the computer
    -r Shutdown and restart the computer
    -a Abort a system shutdown
    -m \\computername Remote computer to shutdown/restart/abort
    -t xx Set timeout for shutdown to xx seconds
    -c "comment" Shutdown comment (maximum of 127 characters)
    -f Forces running applications to close without warning
    -d [u][p]:xx:yy The reason code for the shutdown
    u is the user code
    p is a planned shutdown code
    xx is the major reason code (positive integer less than 256)
    yy is the minor reason code (positive integer less than 65536)

    C:\>
  • by meme_police ( 645420 ) on Tuesday September 21, 2004 @08:14PM (#10314620)
    Spoken by someone who obviously hasn't adminned any enterprise UNIX servers.
  • Re:I Hate to Say It (Score:5, Informative)

    by AstroDrabb ( 534369 ) on Tuesday September 21, 2004 @08:15PM (#10314632)
    Funny, no where in the doc for GetTickCount() [microsoft.com] does it say it is deprecated and not to use it. The only thing it does say is "If you need a higher resolution timer, use a multimedia timer or a high-resolution timer." I don't know what the program needs since I did not write it nor have I seen the code. Maybe they didn't need a high-res timer and wanted a tick count for how long the system has been up? I don't think that is too much to ask from on OS.

    The GetSystemTimeAsFileTime() [microsoft.com] function retrieves the current system date and time. The information is in Coordinated Universal Time (UTC) format. It doesn't tell you how long the system has been up.

    Oh, and if MS did not think this is a problem why did they fix it in a WinNT service pack [microsoft.com]? Also, right in that link MS says

    Microsoft has confirmed that
    this is a problem in Windows NT 4.0 and Windows NT Server 4.0, Terminal Server Edition. This problem was first corrected in Windows NT 4.0 Service Pack 4.0 and Windows NT Server 4.0, Terminal Server Edition Service Pack 4.

    MS also didn't seem to fix it in Win2000 Server and their own engineers got hurt by it, specifically with Rpcss.exe [microsoft.com] which according to MS

    SYMPTOMS

    The Rpcss.exe process consumes 60 percent or more of CPU time, and system performance and network performance are affected. This symptom typically occurs 49.7 days after the server is started.
    CAUSE
    This problem occurs because a call to the GetTickCount timer function causes the function to overflow 49.7 days after the server is started.
    If GetTickCount is "deprecated" as you state, why in the world is MS's own programmers using it in rpcss.exe? According to this site [liutilities.com]
    rpcss.exe is an executable of Microsoft Windows Opearting System. It is reponsible for Remote Procedure Call services on the local machine. These are public services available to the local network.
    This program is important for the stable and secure running of your computer and should not be terminated.

    Still not convinced and want to appologize for MS? Well here are some more of MS's software that are affected by it in Windows 2000 servers (what this FAA project is using).
    Print Spooler Stops Scheduling Print Jobs [microsoft.com]

    The Print Spooler service may stop scheduling print jobs to specific Simple Port Monitor (SPM) ports. Although incoming jobs are queuing into the spooler, print jobs may not start. Note that this symptom
    occurs 49.7 days after you start the Print Spooler service.

    There are a bunch of MS apps affected by this logic flaw [microsoft.com] that has been passed from version to version of MS OSes. If this flaw affected all these MS developers who have far more access to proprietary docs, I don't see how other developers would not stumble over it as well since they do not have access to the proprietary OS.

  • Patriot bug details (Score:3, Informative)

    by Animats ( 122034 ) on Tuesday September 21, 2004 @08:21PM (#10314674) Homepage
    That was a bad bug. It didn't cause system crashes. It caused missile misses. This bug was responsible for an interception failure which allowed an incoming Scud missile to hit a barracks in Saudi Arabia, killing 28 people.

    The radar and the guidance system had separate clocks, and they'd drift out of sync.

    Here's a detailed analysis by the General Accounting Office [fas.org].

  • by agallagh42 ( 301559 ) on Tuesday September 21, 2004 @08:35PM (#10314785) Homepage
    "Nope. Windows 2000 server:
    C:\>shutdown /?
    'shutdown' is not recognized as an internal or external command, operable program or batch file."


    Well, you have to install the resource kit tools. You wouldn't want everything installed by default would you?
  • Re:Repent, Sinners! (Score:2, Informative)

    by Anonymous Coward on Tuesday September 21, 2004 @08:38PM (#10314811)
    But just don't be an apologist for Linux, it just makes "us" look hypocritical.

    >>I wasn't apoligizing. It makes perfect sense to me....

    >>So... WTF would I even have to apologize for? The fact that the parent associates it in his mind with shutting down?

    Down boy! Heel!

    apologist n. A person who argues in defense or justification of something, such as a doctrine, policy, or institution.

    All words that sound vaguely alike don't necessarily mean the same thing.

  • Exactly (Score:3, Informative)

    by autopr0n ( 534291 ) on Tuesday September 21, 2004 @09:03PM (#10314951) Homepage Journal
    windows 2000 can stay up for more then 232 milliseconds, but software that depends on GetTickCount() being correct can't. That's probably what happened. They could have rewritten the software to use a 64 bit time variable, or they could have worked around the bug.

    They didn't, and that caused the crash. Not "buggy windows".

    The fact that they couldn't even figure out how to run a sheduled task in windows to reboot the machine is just pathetic, and shows how incompitant they really are.
  • Re:Repent, Sinners! (Score:4, Informative)

    by multipartmixed ( 163409 ) * on Tuesday September 21, 2004 @09:06PM (#10314969) Homepage
    > since non-SQL formats like DBase have always been
    > a little funky when they start having to deal
    > with million-record tables.

    Oh, yes, SQL the magic bullet. I have a database problem! No matter what it is, I can solve it by migrating to a database system which uses SQL!

    > It's amazing how ugly legacy databases can be
    > compared to today's tech.

    Yes, today's tech! SQL, the magic bullet! Why, we should use Oracle! It's SQL and thus must be modern! It's only been around since 1979!

    Wait!

    1979 was a long time ago.

    Oh, dear?

    Could it be that Oracle is not modern tech? But, how could it not be? It uses SQL, the magic bullet!

    Hint: query language and scalability are not related.
    Hint II: RDBMS is no magic bullet, either.
  • Re:Repent, Sinners! (Score:1, Informative)

    by Darby ( 84953 ) on Tuesday September 21, 2004 @09:18PM (#10315045)
    Windows 2000 you could just schedule a reboot every month with the task scheduler. Win98 and ME have a scheduler also, but I've found that to be rather... unreliable.

    The W2K scheduler isn't reliable either as we recently found out.
    In the first place, you can *only* run scheduled tasks as the system user unless the user who has the task scheduled is actually logged in at the console. This means no non-system scheduled tasks can run if the system reboots. This means driving in to type your user name and password in. Pretty stupid to even bother scheduling in this situation.
    Second, you have to explicitly type in the admin password to the scheduler for each and every task you want to actually schedule (see above).
    Third, it has a habit of forgetting the password you typed in causing all of your scheduled tasks to fail.

    I've never been a fan of Windows, but these recent discoveries led me to the conclusion that it seriously is a toy single user operating system.

    Don't even get me started on the fact that .NET doesn't even support simple basic internet protocols like...say...FT freaking P.

  • by Anonymous Coward on Tuesday September 21, 2004 @11:13PM (#10315723)
    No version of Windows has been certified telecom carrier grade reliable 99.999%. The number of Microsoft programmers and billions can't make Windows reliable. Microsoft won't even attempted to pass the certified telecom carrier grade test. There are version of Linux and embedded Linux that are certified telecom carrier grade reliable.
    There is a serious security in Windows NT 4.0 for a couple of years that has not been fixed. What is Microsoft solution? Let support for Windows NT 4.0 expire at the end the year, then Microsoft won't have to fix serious security flaw. Linux 2.0 (which is older as Windows NT 4.0), 2.2, 2.4 and 2.6 are still supported with the latest security patches.
  • Re:Heh (Score:3, Informative)

    by Michael Woodhams ( 112247 ) on Tuesday September 21, 2004 @11:38PM (#10315883) Journal
    Once (Chicago O'Hare, c1980.) Due to faulty maintenance procedures (now discontinued), lack of locking on slats (now fixed) and engine-out-on-takeoff procedures that sacrificed air speed for altitude.

    There are three DC-10 crashes (that I can think of off hand) that could reasonably be blamed at least partially on the design of the plane: we've mentioned two (Paris, Chicago). The third is Sioux City, where an uncontained engine failure in cruise disabled all three hydrolic systems. The plane crash landed with (from memory) about 110 deaths and 180 survivors.

    Other planes of similar size and age (Lockheed L1011 tristar, 747) had four hydrolic systems. Had the DC-10 had four *and* (that is a big 'and') the fourth had not been disabled, it is unlikely there would have been any deaths. (A 747 once had 3 out of 4 hydrolic systems disabled on takeoff, and landed safely.)

    In terms of safety, I'd be more worried about any model of airplane less than a few years old than I'd be about a well maintained DC-10. Let other people find the surprises first.
  • by Anonymous Coward on Wednesday September 22, 2004 @07:35AM (#10317301)
    The only flaw is that the consulting company that wrote the software was incompetant. They used the GetTickCount API which returns the number of milliseconds since the system was brought up in an unsigned 32-bit value. The documentation clearly states [microsoft.com] that this value will rollover to 0 and continue counting from there after 49.7 days. The documentation also mentions timers with higher resolution as well as better places to get system uptime as a 64-bit value.

    The only reason rebooting Windows was necessary was because this tick value is tracked by the OS and not the application, so restarting the application would not prevent the software bug from causing problems. But the flaw is certainly in the application for using the wrong API for the job.
  • by pdxChris ( 162827 ) on Wednesday September 22, 2004 @01:11PM (#10320039)
    In the mid 1980's, I knew a software engineer at Caltech's Jet Propulsion Laboratory who worked on a multi-year JPL project for the FAA. The project was to replace the obsolete voice communication system for air traffic controllers. The new system had touch screens with onscreen menus and buttons were dynamically reconfigured depending on the controller's workload. It worked correctly, and the engineer enjoyed describing to me how it worked. This was all before there was any version of Windows. If I recall correctly, they developed on MODCOMP minicomputers running VMS but deployed on an embedded system with an in-house design for task switching, not a complete OS. I might be fuzzy about the technical details at this time, but a FOIA request should be able to retrieve them for the intensely curious.

    I do clearly remember that the working system was presented to the FAA in Moneterey, and the FAA then terminated the contract and hired IBM to start over from scratch on a new system. Rumor was that this was a political payback. I should emphasize that's just a rumor I heard. Looks like Harris eventually got the contract. I wonder if any of the original code from JPL was ever deployed.
  • by tbogart ( 802762 ) <tjbogart33@gmail.com> on Thursday September 23, 2004 @03:27AM (#10326936)
    Don't get me wrong - I am not questioning you seem familiar with the effect the problems have on operations. And of course it just shows good sense that as a pilot, you network (!) with the folks you depend on as you describe. But do you network with the programmers or the administrators? It still sounds like you are getting at least two levels removed information from the level any real dirt is available. Perhaps an analogy would be talking to someone who works in the next office to the folks who supervise Air Traffic Controllers rather than the controllers themselves. Sure, if those folks ar interested in aviation and ask the right questions they can gain reliable information, but it is not like going to the, er, appropriate end of the horse.

    FWIW, my father was a machinist/aircraft mechanic and finally technical writer who worked with oil company research labs on improving lubrication, publishing articles in their company publications including doing his own photomicroscopy to analyse corrosion effects.

    My first job out of school with a EE degree was at the Johnson Space Center training astronauts and sitting console. About 70% of the folks I worked with were either military pilots still flying in the reserves or private pilots (and I was fool enough to go do light aerobotics with some of them), plus of course the flight crews. While there, I started dealing with with computers as they first started appearing in offices, and eventually went into full time system administration/ systems engineering, primarily for development groups and test labs.

    Now, the reason I blabbed on like that was to try to establish

    1) I am somewhat familiar with the aviation community from both the 'user' and 'support' aspects.

    2) I am somewhat familiar with the computer community, starting as a user, and moving into the support realm.

    3) I would claim that both the classes I wrote and taught - as well as the time spent on console, directly gives me a somewhat initmate knowledge of translating information from one community into another. You generally don't explain an onboard system to a pilot the same way you would a PHD in EE, or a medical experiment to a pilot as you would an MD.

    One particular conclusion based on my experience in those worlds (and I know this is a bit of a generalization) is that when a pilot or any member of an air crew tells me something about their aircraft or it's surrounding operations, I can probably bet on the information being pretty good.

    If a programmer or administrator tells me something about their program or system, before I put any stock in what they say (beyond my own experience in similar veins), I probe their background and quiz them as much as possible.

    If I wanted to be glib, if programmers/administrators had to go thru the kind of training programs as pilots or even support personell, about 85% would not cut it. Or if these folks made it into the sky, they would be weeded out by the flaming holes in the ground they made.

    If, as I expect, your information is based on what an ATC heard from a guy down the hall, or maybe even was touching a computer, or even from a distilled briefing from the contractor - I would first have to ask how much that ATC knew about systems and programming and see how critically s/he processed (!) that data.

    If you even got the information directly from and admin/programmer, (as you might guess by now), the same set of questions would apply.

    In either case, the point is to wonder aloud if you take that information as if it were coming from folks who are the caliber of the people you are used to relying on.

    Consider your description of the memory issue:
    "It's a memory allocation error that retains some of the old tracking on the system, thus, the whole box needs to be rebooted every 45 days or the memory overloads and crashes the OS."

    The typical memory allocation error doesn't have anything to do with old data still being in the system, but simply that m

I tell them to turn to the study of mathematics, for it is only there that they might escape the lusts of the flesh. -- Thomas Mann, "The Magic Mountain"

Working...