Windows Upgrade, FAA Error Cause LAX Shutdown 862
fname writes "The recent shutdown of LAX due to an FAA radio outage was apparently caused by a Windows 2000 integration flaw, possibility related to an old Windows 95 bug. An article at the LA Times claims that the outage was caused by human error, as the system will automatically shut down after 49.7 days (related to this Windows 95 flaw?), and a technician didn't reboot the system monthly as he should have. This happened after an upgrade from Unix to Windows. I don't think blame should be assigned to the technician who missed the task; rather, it seems a gross oversight for the FAA to guarantee that such a critical system will crash after only one missed maintenance task. Who's really at fault?"
Why is the FAA using off the shelf software? (Score:4, Informative)
But most off the shelf software have disclaimers expressly stating they are not to be used in mission critical situations. Eg:
"technology is not fault tolerant and is not designed, manufactured, or intended for use or resale as on-line control equipment in hazardous environments requiring fail-safe performance, such as in the operation of nuclear facilities, aircraft navigation or communication systems, air traffic control, direct life support machines, or weapons systems, in which the failure of Java technology could lead directly to death, personal injury, or severe physical or environmental damage."
Re:Why not automate it? (Score:2, Informative)
Windows has an odd tendancy to corrupt it self.
Why 49.7 days? (Score:5, Informative)
Because there are 4294080000 millisconds in that time period. Just enough to cause a roll-over when using a 32 bit counter (and yes, 49.7 is an approximate value).
Very few Win95 systems ever made it that long without a reboot... but you would've thought that it would've been fixed by Windows 2000.
Re:Anyone want to clue them in to scheduled jobs? (Score:5, Informative)
Agreed. A well written AT script something like this: Each M T W Th R S Su 12:45 AM shutdown /l /r /y /c
Would do the trick... We have used that exact script for YEARS to nightly reboot a troublesome NT4 BDC at a remote location.
While we knew that this was not a great solution, no one needed to access the server at that time of night. Any right minded IT person should be able to see the flaw in the FAA's logic.
Re:Why is the FAA using off the shelf software? (Score:2, Informative)
Re:Why 49.7 days? (Score:5, Informative)
Re:Repent, Sinners! (Score:3, Informative)
So... it should not initialize (begin) run level 6?
Re:32 bit timer (Score:5, Informative)
Richard Riggs, an advisor to the technicians union, said the FAA - the American aviation regulator - had been planning to fix the program for some time. "They should have done it before they fielded the system," he said.
(emphasis added)
no such thing as a Windows 2000 49.7 day bug (Score:4, Informative)
The problem here is the software made by Harris does not handle a rollover of the GetTickCount() function turning back to 0. This function counts the number of milliseconds since the OS was last booted so it should be obvious to anybody that the returned unsigned 4 byte integer cannot go on forever.
So the badly written Harris software has this bug and their solution (which was really not that bad of a work around) was to manually reboot the system every 30 days, but as a fail-safe, they had a scheduled task to do a reboot on the 49th day just in case. The 49th day came because of procedural error.
There is nothing Microsoft could do to prevent this.
I don't feel redeemed, I feel cheated... (Score:2, Informative)
Re:Now even the submitters aren't reading the arti (Score:1, Informative)
On a completely different subject, I move that any post containing a phrase along the lines of "This is going to get me moderated as Troll" be automatically moderated Troll. Too many of us seem to use it becasue it tends to lead to the opposite result.
Try telinit (Score:2, Informative)
Re:Check out this little pile of bullshit (Score:5, Informative)
Lies, damned lies and availability stats...
A few remarks (Score:3, Informative)
GetTickCount() will rollover. An _application_ which assumes it is a strictly increasing value will misbehave after the 40 some odd days expire. That appears to be what is happening here.
Note that nowhere in the article is there a distinction between the "system" and the "OS" or the "application".
2) Regardless of where the fault is (hint: it's not in Windows), it is not unreasonable for a machine to need servicing. Aircraft engines are serviced at hour based intervals, wether they need it or not. It's better to just tear the thing down and rebuild it than to have it tear itself apart. software doesn't _have_ to be this way, but it sometimes is.
Making a complete hardware -> app layer stack 100% failsafe is.. tricky. For some applications, designing the system with a known restart point.. i.e. a reboot of the app or the entire machine, can be more cost effective.. (see earlier the paper on crash-only software design)..a periodic shutdown/restart in complicated systems can be a valid operational practice.
The fault here is two fold - one, the application/system had a known issue that is probably avoidable, but for whatever reasons, it still has the issue.
Knowing that the issue existed, the proper maintennace was not observed with the expected result - a failure.
Only in america do you get away with blaming Audi for oil sludge problems when you dont change your oil every maintenace interval.
If the system called for a 48th day restart, thats what it requires, and deviation from that has consequences. Luckily no one was hurt.
Re:2K is based on NT kernel (Score:5, Informative)
Just like the "Y2K glitch" was a platform independant problem based upon the 2-digit-year shorthand causing logical flaws, if you store time in a 32-bit variable by the microsecond... you'll hit the hard limit after about 49.7 days which is why that number can show up in kernels other than Win9x. If there's no proper handling of that rollover, things go haywire.
Re:Retard (Score:5, Informative)
Search Microsoft's Knowledge Base for "49.7 days", and you'll find a few bugs, all of them related to storing uptime in milliseconds in an unsigned 32-bit integer. Two were reported in Windows 2000:
That rpcss.exe issue looks like a prime suspect. The OS doesn't crash, but, given the time-sensitive nature of air traffic control data, it's quite possible that the applications running on that server would degrade to the point of failure.
Both look like they were found, or at least entered into the KB, after the release of Windows 2000 Service Pack 4 (Nov. 2003), and hotfixes are available for both.
Note to Microsoft (or anyone else storing milliseconds, for that matter): unsigned 64-bit int! Instead of having to reboot every 49.7 days, you'll have to reboot every 213,503,982,334 days, give or take a leap-second.
Re:Why 49.7 days? (Score:5, Informative)
The exact value is 49 and 59,929/84,375 days, or 49 days, 17 hours, 2 minutes, and 47.296 seconds (exact).
Hey, news for nerds, what did you expect...
Re:32 bit timer (Score:3, Informative)
GetTickCount
The GetTickCount function retrieves the number of milliseconds that have elapsed since the system was started. It is limited to the resolution of the system timer. To obtain the system timer resolution, use the GetSystemTimeAdjustment function.
DWORD GetTickCount(void);
Parameters
This function has no parameters.
Return Values
The return value is the number of milliseconds that have elapsed since the system was started.
Remarks
The elapsed time is stored as a DWORD value. Therefore, the time will wrap around to zero if the system is run continuously for 49.7 days.
If you need a higher resolution timer, use a multimedia timer or a high-resolution timer.
To obtain the time elapsed since the computer was started, retrieve the System Up Time counter in the performance data in the registry key HKEY_PERFORMANCE_DATA. The value returned is an 8-byte value. For more information, see Performance Monitoring.
Example Code
The following example demonstrates how to use a this function to wait for a time interval to pass. Due to the nature of unsigned arithmetic, this code works correctly if the return value wraps one time. If the difference between the two calls to GetTickCount is more than 49.7 days, the return value could wrap more than one time and this code will not work; use the system time instead.
DWORD dwStart = GetTickCount();
if( GetTickCount() - dwStart >= TIMELIMIT )
Cancel();
Example Code
Note that TIMELIMIT is defined as the time interval of interest to the application, in milliseconds.
Requirements
Client: Requires Windows XP, Windows 2000 Professional, Windows NT Workstation, Windows Me, Windows 98, or Windows 95.
Server: Requires Windows Server 2003, Windows 2000 Server, or Windows NT Server.
Header: Declared in Winbase.h; include Windows.h.
Library: Use Kernel32.lib.
Re:What?! (Score:3, Informative)
The article is light on details... (Score:5, Informative)
It _could_ be that an important part of the system is running Windows 95 interfaced to a 2k domain that implements the rest of the system.
That really isn't Microsoft's fault that they didn't patch that critical machine to fix the flaw... or that they felt they needed to run Windows 95 (gag) in such a critical portion of the system.
It _could_ be that a user-land air traffic control related application itself calls an depricated API to return the time in microseconds, which
overflows/wraps around, causing the software to crash.
OR
It _could_ be that the user-land air traffic control software just mis-casts the time from the modern API into a 32-bit data structure, which wraps around, causing the software to crash.
In the latter two cases the article writer or LAX's press staff may have incorrectly drawn the connection to the famous Windows 95 problem... even when it wasn't Microsoft's fault in that case.
I really don't see how Microsoft could be the blame here at all...
It was the app, not the OS (Score:5, Informative)
Re:Check out this little pile of bullshit (Score:1, Informative)
still blown wide open. With a
allowable downtime of 30 seconds
per year, the recent ~12600 second
outage means they are probably not
at the promised
unless the system was actually
brought up four hundred years ago,
and this was it's first unplanned
outage.
Re:Repent, Sinners! (Score:2, Informative)
(ps man console.apps and pam_console)
Re:Anyone want to clue them in to scheduled jobs? (Score:5, Informative)
Of course, the code leap from Tru64 to Linux is quite small, which is the biggest reason why Linux was chosen.
Aviation expects 99.9999% uptime with absolutely no message loss, and we would achieve that with hot-standbys and MySQL mirroring. All circuits were split and would simultaneously enter both servers. Only the primary server would route the message.
No, we didn't require the customer to reboot. The system could run for years at a time.
Putting mission critical applications on Windows 95 is just plain stupid.
Re:Heh (Score:5, Informative)
This occured in flight, and through luck (which allowed some degree of control) and extraordinary airmanship, the plane was landed safely. (This is known as "The Windsor Incident.")
McDonnell-Douglas didn't want to do a proper redesign of the door mechanism, and the FAA head was a 'companies know best' political appointee, so the result was McD added little windows to the door so that the guy closing the door could look to see it had all engaged properly. (This was over vigourous opposition by the NTSB, who recognized the inadequacy of the fix.)
The situation: A single failure (not looking, or looking but not noticing an unsafe condition) by a non-safety trained close to minimum wage employee could cause the deaths of hundreds of people.
Result: over 300 dead when a Turkish Airlines DC-10 crashed near Paris. The guy who closed the door hadn't even been told he was supposed to check the little windows.
Safety critical systems must be tolerant of human error. If a single omission by a human leads to a hazardous situation, this is primarily the fault of the system, not the human.
Re:Repent, Sinners! (Score:4, Informative)
All the init 6 command does is initialize run level 6. You can have run level 6 configured any way you want.
It isn't hard wired to shut down. (On debian run level 6 does a reboot... run level 0 halts the system.)
Re:Anyone want to clue them in to scheduled jobs? (Score:3, Informative)
Uh, since about 2000 I believe.:)
Re:Not necessarily Windows' fault (Score:3, Informative)
Re:I Hate to Say It (Score:5, Informative)
The GetSystemTimeAsFileTime() [microsoft.com] function retrieves the current system date and time. The information is in Coordinated Universal Time (UTC) format. It doesn't tell you how long the system has been up.
Oh, and if MS did not think this is a problem why did they fix it in a WinNT service pack [microsoft.com]? Also, right in that link MS says
MS also didn't seem to fix it in Win2000 Server and their own engineers got hurt by it, specifically with Rpcss.exe [microsoft.com] which according to MS
If GetTickCount is "deprecated" as you state, why in the world is MS's own programmers using it in rpcss.exe? According to this site [liutilities.com]Still not convinced and want to appologize for MS? Well here are some more of MS's software that are affected by it in Windows 2000 servers (what this FAA project is using).
Print Spooler Stops Scheduling Print Jobs [microsoft.com]
There are a bunch of MS apps affected by this logic flaw [microsoft.com] that has been passed from version to version of MS OSes. If this flaw affected all these MS developers who have far more access to proprietary docs, I don't see how other developers would not stumble over it as well since they do not have access to the proprietary OS.
Patriot bug details (Score:3, Informative)
The radar and the guidance system had separate clocks, and they'd drift out of sync.
Here's a detailed analysis by the General Accounting Office [fas.org].
Re:Anyone want to clue them in to scheduled jobs? (Score:2, Informative)
C:\>shutdown
'shutdown' is not recognized as an internal or external command, operable program or batch file."
Well, you have to install the resource kit tools. You wouldn't want everything installed by default would you?
Re:Repent, Sinners! (Score:2, Informative)
>>I wasn't apoligizing. It makes perfect sense to me....
>>So... WTF would I even have to apologize for? The fact that the parent associates it in his mind with shutting down?
Down boy! Heel!
apologist n. A person who argues in defense or justification of something, such as a doctrine, policy, or institution.
All words that sound vaguely alike don't necessarily mean the same thing.
Exactly (Score:3, Informative)
They didn't, and that caused the crash. Not "buggy windows".
The fact that they couldn't even figure out how to run a sheduled task in windows to reboot the machine is just pathetic, and shows how incompitant they really are.
Re:Repent, Sinners! (Score:4, Informative)
> a little funky when they start having to deal
> with million-record tables.
Oh, yes, SQL the magic bullet. I have a database problem! No matter what it is, I can solve it by migrating to a database system which uses SQL!
> It's amazing how ugly legacy databases can be
> compared to today's tech.
Yes, today's tech! SQL, the magic bullet! Why, we should use Oracle! It's SQL and thus must be modern! It's only been around since 1979!
Wait!
1979 was a long time ago.
Oh, dear?
Could it be that Oracle is not modern tech? But, how could it not be? It uses SQL, the magic bullet!
Hint: query language and scalability are not related.
Hint II: RDBMS is no magic bullet, either.
Re:Repent, Sinners! (Score:1, Informative)
The W2K scheduler isn't reliable either as we recently found out.
In the first place, you can *only* run scheduled tasks as the system user unless the user who has the task scheduled is actually logged in at the console. This means no non-system scheduled tasks can run if the system reboots. This means driving in to type your user name and password in. Pretty stupid to even bother scheduling in this situation.
Second, you have to explicitly type in the admin password to the scheduler for each and every task you want to actually schedule (see above).
Third, it has a habit of forgetting the password you typed in causing all of your scheduled tasks to fail.
I've never been a fan of Windows, but these recent discoveries led me to the conclusion that it seriously is a toy single user operating system.
Don't even get me started on the fact that
Windows is NOT telecom carrier grade reliable (Score:1, Informative)
There is a serious security in Windows NT 4.0 for a couple of years that has not been fixed. What is Microsoft solution? Let support for Windows NT 4.0 expire at the end the year, then Microsoft won't have to fix serious security flaw. Linux 2.0 (which is older as Windows NT 4.0), 2.2, 2.4 and 2.6 are still supported with the latest security patches.
Re:Heh (Score:3, Informative)
There are three DC-10 crashes (that I can think of off hand) that could reasonably be blamed at least partially on the design of the plane: we've mentioned two (Paris, Chicago). The third is Sioux City, where an uncontained engine failure in cruise disabled all three hydrolic systems. The plane crash landed with (from memory) about 110 deaths and 180 survivors.
Other planes of similar size and age (Lockheed L1011 tristar, 747) had four hydrolic systems. Had the DC-10 had four *and* (that is a big 'and') the fourth had not been disabled, it is unlikely there would have been any deaths. (A 747 once had 3 out of 4 hydrolic systems disabled on takeoff, and landed safely.)
In terms of safety, I'd be more worried about any model of airplane less than a few years old than I'd be about a well maintained DC-10. Let other people find the surprises first.
Re:flaw isn't in Windows (Score:1, Informative)
The only reason rebooting Windows was necessary was because this tick value is tracked by the OS and not the application, so restarting the application would not prevent the software bug from causing problems. But the flaw is certainly in the application for using the wrong API for the job.
JPL had a working system for the FAA around 1985 (Score:2, Informative)
I do clearly remember that the working system was presented to the FAA in Moneterey, and the FAA then terminated the contract and hired IBM to start over from scratch on a new system. Rumor was that this was a political payback. I should emphasize that's just a rumor I heard. Looks like Harris eventually got the contract. I wonder if any of the original code from JPL was ever deployed.
Re:It was the app, not the OS (Score:2, Informative)
FWIW, my father was a machinist/aircraft mechanic and finally technical writer who worked with oil company research labs on improving lubrication, publishing articles in their company publications including doing his own photomicroscopy to analyse corrosion effects.
My first job out of school with a EE degree was at the Johnson Space Center training astronauts and sitting console. About 70% of the folks I worked with were either military pilots still flying in the reserves or private pilots (and I was fool enough to go do light aerobotics with some of them), plus of course the flight crews. While there, I started dealing with with computers as they first started appearing in offices, and eventually went into full time system administration/ systems engineering, primarily for development groups and test labs.
Now, the reason I blabbed on like that was to try to establish
1) I am somewhat familiar with the aviation community from both the 'user' and 'support' aspects.
2) I am somewhat familiar with the computer community, starting as a user, and moving into the support realm.
3) I would claim that both the classes I wrote and taught - as well as the time spent on console, directly gives me a somewhat initmate knowledge of translating information from one community into another. You generally don't explain an onboard system to a pilot the same way you would a PHD in EE, or a medical experiment to a pilot as you would an MD.
One particular conclusion based on my experience in those worlds (and I know this is a bit of a generalization) is that when a pilot or any member of an air crew tells me something about their aircraft or it's surrounding operations, I can probably bet on the information being pretty good.
If a programmer or administrator tells me something about their program or system, before I put any stock in what they say (beyond my own experience in similar veins), I probe their background and quiz them as much as possible.
If I wanted to be glib, if programmers/administrators had to go thru the kind of training programs as pilots or even support personell, about 85% would not cut it. Or if these folks made it into the sky, they would be weeded out by the flaming holes in the ground they made.
If, as I expect, your information is based on what an ATC heard from a guy down the hall, or maybe even was touching a computer, or even from a distilled briefing from the contractor - I would first have to ask how much that ATC knew about systems and programming and see how critically s/he processed (!) that data.
If you even got the information directly from and admin/programmer, (as you might guess by now), the same set of questions would apply.
In either case, the point is to wonder aloud if you take that information as if it were coming from folks who are the caliber of the people you are used to relying on.
Consider your description of the memory issue:
"It's a memory allocation error that retains some of the old tracking on the system, thus, the whole box needs to be rebooted every 45 days or the memory overloads and crashes the OS."
The typical memory allocation error doesn't have anything to do with old data still being in the system, but simply that m