Comair Done In by 16-Bit Counter 441
Gogo Dodo writes "According to the Cincinnati Post, the Comair system crash was caused by an overflowed 16-bit counter. Perhaps Comair should have paid for the software upgrade to MaestroCrew." You heard it here first...
From Another article... (Score:5, Interesting)
"The computer failure that grounded an airline's entire fleet over the Christmas weekend and stranded thousands of travelers was due to creaky software that couldn't count higher than 32,768."
According to the Post, the software -- which tracks all details of crew scheduling, including how long they have flown (an FAA regulation restricts airtime), and logs every change -- has a 16-bit counter that limits the number of changes to 32,768 in any given month.
to be fair (although it's not an excuse), but 32K crew changes in a month? that's like 1,000 a day? that's crazy!...
Let's not be too hard.. (Score:5, Interesting)
"This probably seemed like plenty to the designers, but when the storms hit last week, they caused many, many crew reassignments, and the value of 32,000 was exceeded," he said.
It's true, it was an extreme connection of circumstances... horrid weather (heck, there was snow in some Texas town for the first time in like 80 years or something, read it in some glurge article) coupled with the winter holidays. They should redesign their system and admit that they've grown to a level where their system is unable to hand extreme circumstances, and this should serve as a great wake-up call for them.
In the past I've always chuckled at the thought of 'upgrading for the sake of upgrading', but I suppose this is one case where an earlier upgrade could have saved them millions and made a lot of people's holidays better.
How old was this software? (Score:2, Interesting)
New CIO? (Score:2, Interesting)
"Well... we were holistically mitigating our financial stance outside the box of current processes while try to forcast our future technological stability within the transport industry."
"Well... you're fired! NEXT?!
Re:From Another article... (Score:3, Interesting)
There was a high profile example of this problem (Score:5, Interesting)
In any case, this is one of those incidents like the Therac-25 accidents that experienced programmers should always have in mind.
Happened to me too (Score:5, Interesting)
The upshot was that I was able to convince management to totally scrap it and allow me to write a new one. The downside was that the idiot who designed the original system went on to spend 100 million dollars on this new, grandious system that too was eventually scrapped, but he knew long before that his turkey wasn't going to fly, so he quit and became a lead architect at some other company.
*Sigh*...okay, back to coding.
Re:Bugtraq covered this as well.. (Score:1, Interesting)
Sure, we had serial terminals instead of graphical displays, but still, a 16-bit signed int for memory conservation in 1989? That's a load of BS.
I wouldn't be surprised if they did it because they never gave it a second thought to use an int, or they needed to save on-disk storage space. Maybe their database backup wouldn't fit onto a 1.2MB 5.25" floppy otherwise.
Re:Comair? (Score:2, Interesting)
There probably isn't any reason to. Comair, as a regional jet carrier, has separate crew contracts and crew rules than Delta, a mainline carrier. Thus they operate completely different types of jets, with different crew staffing requirements. The FAA crew rules might even be different. While it might make sense from a consolidation standpoint to merge the two systems of Comair and Delta, since in reality there would be no interaction and no overlap between the two systems (an RJ pilot isn't suddenly going to jump over to fly a 757) the expense isn't worth it.
p.s. I've traveled through CVG, on Delta, during the holidays. Not anymore... One weather-delayed flight and the whole system falls apart.
Then I hope you also avoid United/United Express/Ted at O'Hare/Denver, Continental at Newark/Houston, Northwest in Detriot, USAirways in Cincinnati, American at O'Hare/Dallas... etc. etc. Every airline, not just Delta, uses hubs, and ground stops at any of these airports will cause significant delays. That's just the reality of air travel these days; if you're really worried, book non-stop travel (and pay up to 10x more).
Re:Bugtraq covered this as well.. (Score:2, Interesting)
Hell, just putting the word "unsigned" in front of "int" (do you really need to tally a negative number of crew reassignments?) would have prevented this particular problem and given double the capacity, all else being equal. If you're worried about memory usage, it's certainly not a good idea to waste a bunch of bits on unneeded signs.
By 1989, we certainly knew that the world was in the habit of using software for ten years or more. Software was being modified all the time as larger memory spaces and requirements came along. It was practice long before that to be explicit about memory-related design decisions because you knew it would be your problem in five years to update the software. Unless you just ran away and quit before the problem came up, and it was somebody else's worry.
Re:New CIO? (Score:4, Interesting)
Probably never. Our CIO is an idiot when it comes to technology. He has a law degree from a big college. He earns six figures for sitting in his office and trading stocks (his stocks) all day. I'll never forget the day he picked up a WordPerfect Office11 box looked at it and then said, "So Tom... who is it that makes Word?"
These guys are dumber than dirt, but they're well-connected in the "good ole boy club"
Re:Error checking is the real culprit (Score:3, Interesting)
In aviation there is a concept of "Legality" which basically states that the FAA in cooperation with the ICAO www.icao.int has set the allowable hours a pilot or F/A can work in a day. Since this is a crew scheduling application the airline was no longer able to determine the flight status of their crews and by FAA regulations needed to stop flying until they could once again determine their crew(s) flight status.
The only plan B the FAA alows is flght cancellation unless the airline has an approved greaseboard process.
The int overflow was just the nail in Ben Franklin's
famous poem
For want of a nail a shoe was lost
for want of a shoe a horse was lost
for want of a horse a rider was lost
for want of a rider the battle was lost
for want of a battle the kingdom was lost
and all for the want of a horseshoe nail.
(or long counter)
Re:Bugtraq covered this as well.. (Score:3, Interesting)
What's particularly disturbing is that nothing was done about this during the big Y2K push 5 years ago. Of course, the official goal of Y2K efforts was to make sure your computers didn't crash on 1/1/2000. But it's pretty hard to separate Y2K bugs from other clock bugs, and I think most places didn't even try. Easier to fix or document the bugs than to classify them. I was involved with the Y2K effort at SGI, and we looked at everything from leap year bugs to the Unix 16-bit clock overflow -- which won't occur until 2038!
SIGNED 16-bit!! (Score:3, Interesting)
Re:Well... (Score:3, Interesting)
But it would have been much easier to fix.
The problem here is that even though "with enough eyes, all bugs are shallow", this application is specialized enough that there might well not have been enough eyes. Still, if it were open, then the people who work there might have spent some time looking through it. And *MIGHT* have found the problem.
OTOH, with open source, when the problem manifested, it could have been debugged and recompiled starting immediately. This might well have saved they several days, certainly several hours.
OTOH, if it were FOSS, then it would be their job to maintain it. This would increase the chance of the problem being fixed, but decrease their ability to point fingers at someone else and say "It's all their fault!", which is the capability that they seem to most desire.
Blaming the unions for a bad management decision? (Score:3, Interesting)
Unions are also used to seeing promises that pay will go back to normal in good times broken. The reality of capitalism is that if you don't have your act together enough to meet a known wages bill your company is probably going to expire.
While unions are not the problem - some unreasonable bastard in a paticular union may be - but you get that in all kinds or organisations. It sounds like there is an "us or them" attitude going on, where each group hates the other, which can lead to all kinds of problems and the end of the company if it isn't sorted out.
The USA has all kinds of protections to stop better run airlines coming in from overseas to create even more intense competition. The land that gave us Valuejet and the mess that was United in its final years really needs to get its act together, stop blaming the unions, and see if they can do as well as any of a score of airlines that would be happy to come in as soon as deregulation happens and show how airlines work in the rest of the world.
When I go to the USA I'd better catch a bus, I bet the bus companys scheduling software is less than fifteen years old and has been updated if the company has grown - that's what most places do for business critical applications, and it has nothing at all to do with unions.