Comair Done In by 16-Bit Counter 441
Gogo Dodo writes "According to the Cincinnati Post, the Comair system crash was caused by an overflowed 16-bit counter. Perhaps Comair should have paid for the software upgrade to MaestroCrew." You heard it here first...
Forget Y2k... (Score:5, Funny)
Re:Forget Y2k... (Score:2)
2^16 (or unsigned short in C++ on x86), stores values from 0 to 65535
Re:Forget Y2k... (Score:4, Informative)
It was a signed integer. The problem occured at 2^15 (32768) (although the article reported it as 32,000)
Well... (Score:5, Funny)
Re:Well... (Score:3, Interesting)
But it would have been much easier to fix.
The problem here is that even though "with enough eyes, all bugs are shallow", this application is specialized enough that there might well not have been enough eyes. Still, if it were open, then the people who work there might have spent some time looking through it. And *MIGHT* have found the problem.
OTOH, with open source, when the problem manifested, it could have been debugged and recompiled starting immediately.
actually... (Score:2, Funny)
Re:actually... (Score:2)
Re:actually... (Score:5, Funny)
Common problem (Score:4, Insightful)
Re:Common problem (Score:4, Insightful)
Even if a system is outsourced it doesn't provide a company with 100% stable system. Frequently businesses define the type of system they want hardware/software and the amount they're willing to pay for it.
I work in a company that provides outsourced solutions. Monthly we provide info to businesses about their system. Also, we frequently make recommendations to augment the systems to improve performance. Businesses often choose to ignore our reports and recommendations.
Nothing's more frustrating then a meeting with a business having them tell us we mucked it up and in return we drop off the last 6 months of recommendations on upgrades to provide them additional hardware for their growing requirements and question why they choose to ignore it.
Now I'm not saying the provider didn't muck up. But, what I am saying is your statement that it's all the provider's fault may not be the case as the airlines probably choose to stay on that system as it 'met' their needs as they saw them.
Re:Common problem (Score:4, Insightful)
Only if it's done wrong.
The "value" of outsourcing in that particular example is that it forces the company to completely spec out the system requirements. No changes without documentation. With poorly controlled internal development, changes happens in the hallways or the cafeteria: "Hey, Rick, did you add the code to handle the offline situation?" "Oh, right, I'll just put in a return value for you." This leads to code that doesn't match its spec, making it harder to maintain. Outsourcing tends to enforce a good interface between spec and code, (which is what your claim seems to be.)
Internally developed programs don't necessarily receive the same amount of attention to detail because the programmers typically have an idea about the business domain of the problem, and can work more with less documentation. In some organizations, this leads to "fast and loose" -- great for response time, not so great for maintainability.
I think the "value" of outsourcing in a case like ComAir's is one of liability: ComAir will probably try to play "let's blame the vendor." Or, maybe they'll offer up for sacrifice only the one guy who signed the contract with the vendor, and not an entire division. But, when a failure reaches this magnitude, I don't think they'll get off that easy.
Bugtraq covered this as well.. (Score:5, Informative)
Hi,
On Christmas Day last Saturday, Comair Airlines had to completely stop
flying
all of its planes due to computer problems. Comair blamed the computer
problems on their pilot scheduling software being overloaded after bad
weather earlier in the week forced many flights to be rescheduled. Comair
now hopes to have all of its 1,100 daily flights restored by tomorrow.
An article which was published today at the Cincinnati Post Web site
provides some interesting details of a software failure in Comair's pilot
scheduling software:
How it happened
http://www.cincypost.com/2004/12/28/comp12-28-200
According to the article, Comair is running a 15-year old scheduling
software package from SBS International (www.sbsint.com). The software has
a hard limit of 32,000 schedule changes per month. With all of the bad
weather last week, Comair apparently hit this limit and then was unable to
assign pilots to planes.
It sounds like 16-bit integers are being used in the SBS International
scheduling software to identify transactions. Given that the software is 15
years old, this design decision perhaps was made to save on memory usage.
In retrospect, 16-bit integers were probably not a good choice.
An anonymous message posted to Slashdot the day after Christmas first
described the software failure at Comair:
http://slashdot.org/comments.pl?sid=134005&cid=11
Earlier this year, an overflow of a 32-bit counter in Windows shut down air
traffic control over southern California for 3 hours:
Microsoft server crash nearly causes 800-plane pile-up
http://www.techworld.com/opsys/news/index.cfm?New
This problem occurred because of a known design flaw in older versions of
Windows:
http://tinyurl.com/5n9gc
Richard M. Smith
http://www.ComputerBytesMan.com
Re:Bugtraq covered this as well.. (Score:5, Insightful)
Rubbish. Don't judge yesteryear's programs by today's standards. Back then 4MB RAM cost more than $200. That's how important memory conservation was. In 1989 using an int was a perfectly acceptable choice. If you were programming back then you'd know how loathe programmers were to use longs when they didn't have to. (Granted an unsigned int would've worked better here, but that 64K limit could've also been reached.)
The software spec probably says something to the effect of "Don't attempt to schedule more than 32,767 crew changes." If you're running software that's more than a decade old you need to know what the limits of your software are.
Re:Bugtraq covered this as well.. (Score:5, Informative)
Re:Bugtraq covered this as well.. (Score:4, Insightful)
The only place where shaving bits made sense for us was on data records: we had a hash file with 2.1 million records, each 29 bytes long and it they all had to fit on a single 80MB hard drive. We squeezed every single bit out of those records (including developing a 3-byte integer to handle amounts that we told them could never exceed $99,999.99 (among other things, larger amounts would not have printed correctly.) But they were read-only records to us: we never wrote more than a few thousand rows of data, and we had plenty of space for the day's processing. And when they did have the odd line item that exceeded $100,000.00, they figured out to break it up into multiple smaller items.
And we got bit more than once by overflows. It took like three separate f-ups to get this guy to acknowledge that he needed to stop being stingy with the bytes. Even then, he'd still try to sneak in some memory "savings", but at least he stopped arguing when we called him on them.
Re:Bugtraq covered this as well.. (Score:3, Interesting)
Indeed. I get the impression that Boeing is very unmotivated when it comes to keeping its IS technology up to date. Until recently, they were still using slips of paper to track the process of assembling their airplanes!
What's particularly disturbing is that nothing was done about this during the big Y2K push 5 years ago. Of course, the official goal of Y2K efforts was to make sure your comput
From Another article... (Score:5, Interesting)
"The computer failure that grounded an airline's entire fleet over the Christmas weekend and stranded thousands of travelers was due to creaky software that couldn't count higher than 32,768."
According to the Post, the software -- which tracks all details of crew scheduling, including how long they have flown (an FAA regulation restricts airtime), and logs every change -- has a 16-bit counter that limits the number of changes to 32,768 in any given month.
to be fair (although it's not an excuse), but 32K crew changes in a month? that's like 1,000 a day? that's crazy!...
Re:From Another article... (Score:5, Funny)
You arent by any chance the original developer of this software?
Re:From Another article... (Score:3, Interesting)
Re:From Another article... (Score:4, Insightful)
I would suspect the attitude of debating a limit without knowing the business context your design choice exists in is probably what created this error to begin with.
Let's not be too hard.. (Score:5, Interesting)
"This probably seemed like plenty to the designers, but when the storms hit last week, they caused many, many crew reassignments, and the value of 32,000 was exceeded," he said.
It's true, it was an extreme connection of circumstances... horrid weather (heck, there was snow in some Texas town for the first time in like 80 years or something, read it in some glurge article) coupled with the winter holidays. They should redesign their system and admit that they've grown to a level where their system is unable to hand extreme circumstances, and this should serve as a great wake-up call for them.
In the past I've always chuckled at the thought of 'upgrading for the sake of upgrading', but I suppose this is one case where an earlier upgrade could have saved them millions and made a lot of people's holidays better.
Re:Let's not be too hard.. (Score:3, Informative)
Re: (Score:3, Informative)
Re:It shows what unions are all about (Score:5, Insightful)
The smaller carriers all have one thing in common - no unions. They do not pay their pilots as much, and their pilots do not get paid if they don't fly. The number one expense for an airline is fuel, but the number two expense are the pilots, stewardesses, mechanics, and baggage handlers. There was no way for older airlines to meet the new market conditions (fly more for less profit per flight) without paying people less. The problem is that no one wants to be paid less, so instead they get rid of the least powerful people (who also happen to be the least paid). This is also specified in the union contract... all laying off actions must be FILO.
Essentially, the major carriers are hamstrung by the unions, and they will not survive long term. Unions work by artificially limiting labor supply - but that doesn't work if there is not enough work.
The unions say how evil it is that they are getting pay cuts, but where exactly do they expect the money to come from? The government really should not prop up certain providers when others are eager to take there place. Competition works for the most part. Air travel is becoming a commodity market, like cars. Market transitions cause upheavals, and change the market leaders - especially if the current leaders cannot change their bussiness structure.
I have to say that I totally disagree about management being incompetant - the current management (at least the upper level ones I deal with) are extremely good. They may even get the airline to survive and change to the current market conditions. But what has really destroyed the airlines is the changing markets, and the unions preventing the old airlines to change with the times. The only thing management could have done would be to have rejected the union contracts earlier. But I doubt if that was possible.
Unions seem to believe that society owes them a living. The problem is that society (except in the form of government) is not a person, and so recognizes no debts. Fighting that is totally ineffective because there is no one to fight.
Blaming the unions for a bad management decision? (Score:3, Interesting)
It's often easier to blame the unions than fix bad management practices, and in a lot of cases competitors that have workers in the same union operate well and get the job done instead of throwing their hands in the air, blaming the unions, and refusing to move into this century.
Unions are also used to seeing promises that pay will go back to normal in good times broken. The reality of capitalism is that if you don't have your act together enough
How old was this software? (Score:2, Interesting)
Re:How old was this software? (Score:2, Insightful)
Re:How old was this software? (Score:2)
So after Y2K is this ... (Score:5, Funny)
Yeahhhhhh! Mmmmmmkay!
Did you get that memo?
Re: (Score:2)
Re:So after Y2K is this ... (Score:2)
Wasn't it Nic Cage? (Score:2, Funny)
Re:Wasn't it Nic Cage? (Score:3, Funny)
Let's try to remember (Score:5, Funny)
Re:Let's try to remember (Score:2, Funny)
What about two of those little baggage carts crashing in the arrivals area? Surely that's even less traumatic?
Damn you 2s Complement! (Score:3, Insightful)
Re:Damn you 2s Complement! (Score:2)
-N
Don't mod me up. Mod up the Zucker brothers. (Score:4, Funny)
When the shuttle on the screen blows up, and is accompanied by a very loud explosion sound outside the building, the kid looks sheepish and sneaks away.
Re:Don't mod me up. Mod up the Zucker brothers. (Score:2)
Mod 'em up anyway. Just because. "Airplane II" had so many great gags that anyone should be proud to have their name on it:
the automatic doors that go SHH when you go SHH at them.
talking to Shatner on the viewscreen which turns out to be a window in a door
keeping with current trends, elevator music in the near future will be deafening.
the aforementioned kid videogame scene.
Error checking is the real culprit (Score:5, Insightful)
But the real problem is a lack of error checking. It sounds like the code had something like:
int num_crew_changes;
crew_change_list[++num_crew_changes] = blah;
And the counter wrapped and the system crashed.
The code should have said:
if (num_crew_changes == MAXINT)
{
ERROR(E1234, "too many crew changes");
}
The system is still degraded after 32767 crew changes. It might be so degraded as to be unusable. But at least the company would know the extent of the degradation and could pull out the appropriate "Plan B". It's much safer and better to work around a known problem of known scope than to work around a system crash when you don't know the exact problem.
The "Real" culprit... (Score:2)
Re:Error checking is the real culprit (Score:3, Interesting)
In aviation there is a concept of "Legality" which basically states that the FAA in cooperation with the ICAO www.icao.int has set the allowable hours a pilot or F/A can work in a day. Since this is a crew scheduling application the airline was no longer able to determine the flight status of their crews and by FAA regulations needed to stop flying until they could once again determine their crew(s) flight status.
It's times like this... (Score:5, Funny)
Re:It's times like this... (Score:2)
If my seat literally costs $300 and they sell it to me for $225 that's just stupid. Even if I only fly that airline it still COSTS THEM MONEY to fly me.
So why don't airlines just charge what it actually costs to fly. The others who don't will die anyways because well they're losing money with each "heart, mind and body" they "win" with lower prices.
As for this particular bug
Re:It's times like this... (Score:2)
Re:It's times like this... (Score:2)
The only reason why the airlines are so popular is because they made it relatively cheap. It's just not the reality of the situation.
It's a business that truly fills a niche [e.g. travel by air as a last resort] that it should be at least for now. I never shed a tear for businesses that assume it's their right to make billions in profits. This is the same thing as the movie/music industry bitching when people stop shilling over the dough for their latest "product".
If
Re:It's times like this... (Score:3, Insightful)
My point is the **real** solution is to
get this, this is a doozy
====> **** NOT FLY THE FUCKING PLANE IN THE FIRST PLACE **** <====
If you're over supplying the true demand then you're always going to waste money. Don't make 90 million gizmos when there is only demand for 1 million gizmos.
The demand for air travel only surged when discount rates appeared. Discount rates only appeared to fill seats [re: artificial demand].
Re:It's times like this... (Score:4, Insightful)
A sells tickets for $0 loss.
B sells tickets for $75 loss.
B gains many customers. However, the more customers the more loss they incur. Recall EVERY SEAT costs them $75. Eventually B just runs out of money and ups the costs.
Now A and B sell at the same cost. Customers notice the price hike and get upset [because for some reason people think air travel is a god given right so they get insanely upset at everything].
Sure some won-over customers will stay with B but many will spread out [many are also not particularly loyal they just use whatever cheaptickets.com tells them to].
Tell me I'm wrong. Tell me that most airlines haven't been filing for protection. Come on, tell me
Tom
Re:It's times like this... (Score:2)
The simple math of the situation is if it costs them money for me to sit in the seat then they're not on the road to recovery.
How can you honestly tell your investors "you're doing all you can to not go for broke" when you sell seats at a loss?
Tom
You've missed one dimension (Score:3, Insightful)
You want to fly to Los Angeles in a month and purchase the ticket then. The price you pay reflects the value of being able to make that choice then and assuring the airline of a seat being filled in a month. The value of the seat changes as time goes on such that 1 day before the plan
New CIO? (Score:2, Interesting)
"Well... we were holistically mitigating our financial stance outside the box of current processes while try to forcast our future technological stability within the transport industry."
"Well... you're fired! NEXT?!
Re:New CIO? (Score:4, Interesting)
Probably never. Our CIO is an idiot when it comes to technology. He has a law degree from a big college. He earns six figures for sitting in his office and trading stocks (his stocks) all day. I'll never forget the day he picked up a WordPerfect Office11 box looked at it and then said, "So Tom... who is it that makes Word?"
These guys are dumber than dirt, but they're well-connected in the "good ole boy club"
Once did IT support for Comair (Score:3, Informative)
unsigned (Score:2)
why would anybody make ae event counter a signed value?
short numberScheduleChanges;
hello?
unsigned short numberScheduleChanges;
fixes the problem.
Re:unsigned (Score:4, Insightful)
fixes the problem.
You do realize that you've just fallen into the same trap, right? That doesn't fix the problem worth a damn. I mean, sure it doubles the amount of changes. And yes, 64,000 should be enough. But, hey, 32,000 should have been enough too, right?
Programs have internal limits. That's kosher. What's not appropriate is allowing the user base to exceed them or - for something like this - come close to exceeding them, without giving some kind of warning that notifies people of an impending problem and provides possible solutions (purge data, etc). Now you may point out that adding that kind of security increases the cost and complexity of software. Yup. That's why true enterprise software is expensive. Because that's what you're paying for.
Another alternative would have been for the software wrap and start purging existing records to make room for new ones. Either way, there should have been some defined strategy for the boundary condition, and there wasn't.
The other thing that the software vendor should have done when pushing their upgrade is point out that the previous version wouldn't allow flights to continue in that situation, but the new version expanded it to (some large number). Instead, they probably said, "We're 32 bit!" or something totally meaningless to the people evaluating the business case for the upgrade.
Playing with fire (Score:2, Insightful)
Maestro sucks. (Score:4, Informative)
This stuff could be handled by a team of a dozen web based programmers (Java? C? ASP? LAMP? You pick.) in a few months. It's not difficult.
The other Maestro (Score:2)
Not only that, a lot of Oldfield's game involves piloting glider- or airplane-like avatars.
/. acquired by The Sun (UK) ? (Score:2)
"yeah what 'appened right was this computer-me-thingy went all pear-shaped and these silly buggers Comair went and got themselves 'Done In'..."
ComAir Now Hiring IT People (Score:5, Funny)
Re:ComAir Now Hiring IT People (Score:3, Insightful)
Read to the bottom. They're also hiring a "Staff Scheduler". Only a high school diploma required and 1 year of experience. Maybe they should raise their qualification requirements for this one given recent difficulties
There was a high profile example of this problem (Score:5, Interesting)
In any case, this is one of those incidents like the Therac-25 accidents that experienced programmers should always have in mind.
Coding practice (Score:3, Insightful)
Back in the 80s when I was C programmer (K&R, thank you, the one true C), C integer types were not standardized. "Integers" were defind to be the most natural size for a machine (typically a data word), "shorts" were defined to be no larger than ints, but possibly smaller (and thus possibly more space efficient). This reflected the philosophy of C-as-portable-assembler: if you were indexing an array of character representations of digits, for example, t
Hmm.. Delta owns Comair... (Score:2)
I'm wondering when Delta & Comair identified the troublesome software package as something that they should upgrade? The article says that it's going to be upgraded in upcoming weeks, so when did the upgrade get approved?
Is Comair going to try to go after SBS International for damages?
Has this issue ever come up before or di
martian rovers two week computer crash (Score:2)
So the point is, t
Happened to me too (Score:5, Interesting)
The upshot was that I was able to convince management to totally scrap it and allow me to write a new one. The downside was that the idiot who designed the original system went on to spend 100 million dollars on this new, grandious system that too was eventually scrapped, but he knew long before that his turkey wasn't going to fly, so he quit and became a lead architect at some other company.
*Sigh*...okay, back to coding.
SIGNED 16-bit!! (Score:3, Interesting)
Re:SIGNED 16-bit!! (Score:3, Informative)
This might be viewed as laziness depending on the cirucmstances. Obviously, it seems weird to waste half of the integer space just so you can return -1 on error, but if you need to report many various error condi
Re:Signed or unsigned (Score:5, Informative)
Tom Carter, a computer consultant with Clover Link Systems of Los Angeles, said the application has a hard limit of 32,000 changes in a single month.
"This probably seemed like plenty to the designers, but when the storms hit last week, they caused many, many crew reassignments, and the value of 32,000 was exceeded," he said.
So it sounds like a signed int.
Re:Signed or unsigned (Score:2)
Re:Signed or unsigned (Score:2, Insightful)
Since 2^16 = 65536, I'm guessing signed.
Re:Comair? (Score:2, Insightful)
Time is valuable. (Score:2)
I can't stand it when someone posts a URL to some mailing list, telling everyone to go and look at it, without telling us why we should care about it.
When taken without neighboring information, the only clues that Slashdot gave about the article was that it was in the 'IT' section, and had a 'bug' picture next to it, so we know it was a technology problem, which most computer geeks would have known fr
Re:Time is valuable. (Score:2)
Thank you -- that's all I'm trying to say. I'm frustrated at how amateur most of the Slashdot editors are, but I honestly wasn't trying to be a troll about it.
A simple amendment to the statement they used would have been enough to clarify everythi
Re:Comair? (Score:2)
Re:Comair? (Score:2, Funny)
Re:Comair? (Score:5, Insightful)
Now my question would be, since they're owned by Delta, why wouldn't Comair flights be handled within Delta's own reservation/flight tracking system?
p.s. I've traveled through CVG, on Delta, during the holidays. Not anymore... One weather-delayed flight and the whole system falls apart.
Re:Comair? (Score:2, Insightful)
Thank you. So if the article just said something like...
Re:Comair? (Score:2)
Re:Maybe it had "worked just fine" for them? (Score:5, Insightful)
Re:Maybe it had "worked just fine" for them? (Score:3, Insightful)
Very rarely does
Re:Maybe it had "worked just fine" for them? (Score:5, Insightful)
Re:Maybe it had "worked just fine" for them? (Score:2)
Re:Maybe it had "worked just fine" for them? (Score:2)
Re:Maybe it had "worked just fine" for them? (Score:4, Insightful)
Business decides to buy a software package. After a while, upgrades come out, and the old version keeps getting pushed to the limits. IT adivses business of this, and says that an upgrade/replacement will resolve the problem, but business refuses to authorize said upgrade/replacement.
How do you propose IT "make it work" when their hands are tied? Even worse, IT will take the blame when it wasn't even their decision to make.
Re:Maybe it had "worked just fine" for them? (Score:3, Insightful)
THIS INCLUDES COMPUTING.
Part of serving the business is being built to capacity. This is no different than a correctly tooled factory or having enough warm bodies.
Your mentality simply falls from the illusion that IT isn't an integral part of the business. You can always choose to be "penny wise". However, that always comes with inherent risk.
They question you need to ask your CIO is: Do you feel lucky?
Re:Maybe it had "worked just fine" for them? (Score:3, Insightful)
That said, it's very true that many businesses get by "just fine" with existing, antiquated systems. Justifying system upgrades can be difficult from a conventional cost-benefit standpoint, when a large part of the benefit is based on preventing theoretical problems like this one.
Re:Maybe it had "worked just fine" for them? (Score:2)
It's also true that many organizations don't upgrade things because they continue to run worry-free for ages, therefore upgrades aren't seen as necessary.
Upgrades & security patches often get overlooked because of the old saying - "If it isn't broken, don't fix it!"
Re:Maybe it had "worked just fine" for them? (Score:5, Informative)
RTFA RTFA RTFA - The new system goes live in January. Good god its like herding cats around here.
Gotta love
I did RTFA (Score:3, Funny)
First paragraph. I had just forgotten about it by the time I got to the *end* of the article. 6am + ADD - caffeine = me missing that bit. My bad.
Re:Maybe it had "worked just fine" for them? (Score:4, Informative)
Re:Maybe it had "worked just fine" for them? (Score:2)
Thank you Mr. Monday-Morning-Quarterback.
Its all different once you are on the field.
Re:Maybe it had "worked just fine" for them? (Score:3, Funny)
Re:Maybe it had "worked just fine" for them? (Score:3, Funny)
Re:Maybe it had "worked just fine" for them? (Score:3, Funny)
Re: (Score:3, Funny)
Re:Maybe it had "worked just fine" for them? (Score:2, Insightful)
Re:65535+2 post (Score:3, Insightful)
Re:65535+2 post (Score:3, Insightful)
The article says they had 1100 flights on one day. That's 34100 per month. So basicly they seem to have had on average about one crew schedule change per flight.
Now, that is a very bad failure rate, but given that any one crew change probably causes a mass of knock-on changes (Fred misses this flight, so you have to substitute John, and then someone has to take over what John should be doing for the rest of the da
Re:Hmmmm.... Maybe I should edit my software (Score:4, Insightful)
It always fits perfectly in the context as well, as this example proves.