Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Bug

Comair Done In by 16-Bit Counter 441

Gogo Dodo writes "According to the Cincinnati Post, the Comair system crash was caused by an overflowed 16-bit counter. Perhaps Comair should have paid for the software upgrade to MaestroCrew." You heard it here first...
This discussion has been archived. No new comments can be posted.

Comair Done In by 16-Bit Counter

Comments Filter:
  • by Cytlid ( 95255 ) on Thursday December 30, 2004 @08:58AM (#11218201)
    This was Y32k!
  • Well... (Score:5, Funny)

    by Tuxedo Jack ( 648130 ) on Thursday December 30, 2004 @08:58AM (#11218202) Homepage
    It seems that 16 bits and 640K wasn't enough for them after all.
  • actually... (Score:2, Funny)

    by erroneus ( 253617 )
    ...I heard it on BugTraq first...
  • Common problem (Score:4, Insightful)

    by confusion ( 14388 ) on Thursday December 30, 2004 @09:01AM (#11218227) Homepage
    Well, not this specific problem, but businesses have a common problem of outgrowing the systems that run their business. OTOH, this was an outsourced solution, so this case is pretty hard to explain away, other than sheer incompetence.
    • Re:Common problem (Score:4, Insightful)

      by Anonymous Coward on Thursday December 30, 2004 @09:12AM (#11218298)
      That's not true.

      Even if a system is outsourced it doesn't provide a company with 100% stable system. Frequently businesses define the type of system they want hardware/software and the amount they're willing to pay for it.

      I work in a company that provides outsourced solutions. Monthly we provide info to businesses about their system. Also, we frequently make recommendations to augment the systems to improve performance. Businesses often choose to ignore our reports and recommendations.

      Nothing's more frustrating then a meeting with a business having them tell us we mucked it up and in return we drop off the last 6 months of recommendations on upgrades to provide them additional hardware for their growing requirements and question why they choose to ignore it.

      Now I'm not saying the provider didn't muck up. But, what I am saying is your statement that it's all the provider's fault may not be the case as the airlines probably choose to stay on that system as it 'met' their needs as they saw them.
  • by EvilStein ( 414640 ) <spam@BALDWINpbp.net minus author> on Thursday December 30, 2004 @09:02AM (#11218238)
    Here's the original post [neohapsis.com]:

    Hi,

    On Christmas Day last Saturday, Comair Airlines had to completely stop
    flying
    all of its planes due to computer problems. Comair blamed the computer
    problems on their pilot scheduling software being overloaded after bad
    weather earlier in the week forced many flights to be rescheduled. Comair
    now hopes to have all of its 1,100 daily flights restored by tomorrow.

    An article which was published today at the Cincinnati Post Web site
    provides some interesting details of a software failure in Comair's pilot
    scheduling software:

    How it happened
    http://www.cincypost.com/2004/12/28/comp12-28-2004 .html

    According to the article, Comair is running a 15-year old scheduling
    software package from SBS International (www.sbsint.com). The software has
    a hard limit of 32,000 schedule changes per month. With all of the bad
    weather last week, Comair apparently hit this limit and then was unable to
    assign pilots to planes.

    It sounds like 16-bit integers are being used in the SBS International
    scheduling software to identify transactions. Given that the software is 15
    years old, this design decision perhaps was made to save on memory usage.
    In retrospect, 16-bit integers were probably not a good choice.

    An anonymous message posted to Slashdot the day after Christmas first
    described the software failure at Comair:

    http://slashdot.org/comments.pl?sid=134005&cid=111 85556

    Earlier this year, an overflow of a 32-bit counter in Windows shut down air
    traffic control over southern California for 3 hours:

    Microsoft server crash nearly causes 800-plane pile-up
    http://www.techworld.com/opsys/news/index.cfm?News ID=2275

    This problem occurred because of a known design flaw in older versions of
    Windows:

    http://tinyurl.com/5n9gc

    Richard M. Smith
    http://www.ComputerBytesMan.com

    • by dmccarty ( 152630 ) on Thursday December 30, 2004 @09:22AM (#11218370)
      It sounds like 16-bit integers are being used in the SBS International scheduling software to identify transactions. Given that the software is 15 years old, this design decision perhaps was made to save on memory usage. In retrospect, 16-bit integers were probably not a good choice.

      Rubbish. Don't judge yesteryear's programs by today's standards. Back then 4MB RAM cost more than $200. That's how important memory conservation was. In 1989 using an int was a perfectly acceptable choice. If you were programming back then you'd know how loathe programmers were to use longs when they didn't have to. (Granted an unsigned int would've worked better here, but that 64K limit could've also been reached.)

      The software spec probably says something to the effect of "Don't attempt to schedule more than 32,767 crew changes." If you're running software that's more than a decade old you need to know what the limits of your software are.

      • by imsabbel ( 611519 ) on Thursday December 30, 2004 @09:55AM (#11218596)
        200$ for 4MB? Thas more 1994 than 1989...
      • by plover ( 150551 ) * on Thursday December 30, 2004 @10:40AM (#11218980) Homepage Journal
        In 1988 I was constantly having this argument with one of our other developers. He insisted on using a char when enumerating 80 or 90 status codes, or a short when conditions were "unlikely" that we'd need a long. We both grew up programming in the '70s (at which point I'd have agreed with you -- back then we only had 16Kwords to play in.) Yes, our 2MB boxes were pretty tight on memory, but even in the 1980s it was obvious that saving a single byte in the executable was a false economy, if it risked stability.

        The only place where shaving bits made sense for us was on data records: we had a hash file with 2.1 million records, each 29 bytes long and it they all had to fit on a single 80MB hard drive. We squeezed every single bit out of those records (including developing a 3-byte integer to handle amounts that we told them could never exceed $99,999.99 (among other things, larger amounts would not have printed correctly.) But they were read-only records to us: we never wrote more than a few thousand rows of data, and we had plenty of space for the day's processing. And when they did have the odd line item that exceeded $100,000.00, they figured out to break it up into multiple smaller items.

        And we got bit more than once by overflows. It took like three separate f-ups to get this guy to acknowledge that he needed to stop being stingy with the bytes. Even then, he'd still try to sneak in some memory "savings", but at least he stopped arguing when we called him on them.

      • If you're running software that's more than a decade old you need to know what the limits of your software are.

        Indeed. I get the impression that Boeing is very unmotivated when it comes to keeping its IS technology up to date. Until recently, they were still using slips of paper to track the process of assembling their airplanes!

        What's particularly disturbing is that nothing was done about this during the big Y2K push 5 years ago. Of course, the official goal of Y2K efforts was to make sure your comput

  • by bje2 ( 533276 ) * on Thursday December 30, 2004 @09:03AM (#11218245)
    from information week [informationweek.com]

    "The computer failure that grounded an airline's entire fleet over the Christmas weekend and stranded thousands of travelers was due to creaky software that couldn't count higher than 32,768." ...

    According to the Post, the software -- which tracks all details of crew scheduling, including how long they have flown (an FAA regulation restricts airtime), and logs every change -- has a 16-bit counter that limits the number of changes to 32,768 in any given month. ...

    to be fair (although it's not an excuse), but 32K crew changes in a month? that's like 1,000 a day? that's crazy!...
    • by Anonymous Coward on Thursday December 30, 2004 @09:15AM (#11218310)
      >... 32K crew changes in a month? that's like 1,000 a day? that's crazy!

      You arent by any chance the original developer of this software?
    • Legacy systems will often contain such hard limits. Usually, they are buried deep in the code and sometimes no one knows that they exist. Any point where such hard limits exist must be discovered. A solution needs to then be designed for each situation. If you are a manager or a maintainer of such a system, it is your responsibility to do this. When you are questioned, just point out the Comair computer disaster.
    • by tsangc ( 177574 ) on Thursday December 30, 2004 @10:42AM (#11218995)
      to be fair (although it's not an excuse), but 32K crew changes in a month? that's like 1,000 a day? that's crazy!...


      I would suspect the attitude of debating a limit without knowing the business context your design choice exists in is probably what created this error to begin with.

  • by Staplerh ( 806722 ) on Thursday December 30, 2004 @09:04AM (#11218250) Homepage
    This was a horrible chain of events that severely inconvenienced a lot of people for Christmas, and I would be hoppin' mad if I was in any of their places. However, let's not jump on ComAir too hard, IMHO. From TFA:

    "This probably seemed like plenty to the designers, but when the storms hit last week, they caused many, many crew reassignments, and the value of 32,000 was exceeded," he said.

    It's true, it was an extreme connection of circumstances... horrid weather (heck, there was snow in some Texas town for the first time in like 80 years or something, read it in some glurge article) coupled with the winter holidays. They should redesign their system and admit that they've grown to a level where their system is unable to hand extreme circumstances, and this should serve as a great wake-up call for them.

    In the past I've always chuckled at the thought of 'upgrading for the sake of upgrading', but I suppose this is one case where an earlier upgrade could have saved them millions and made a lot of people's holidays better.
    • They HAD outgrown their current system, and they knew it. That's why the new system was scheduled to go online in the next couple months. Unfortunatly they met with a perfect storm of problems just at the wrong time. If you've ever worked with retail you know that NOTHING gets changed from mid November to early January unless god and the CEO both say it has to be so, I imagine airlines are pretty much the same. Heck airlines probably have an even larger freeze window since few people book flights at the las
  • I stopped using 16 bit ints for anything 10 (or more) years ago when I had the joy of migrating systems from a 16 bit OS to a 32 bit OS.
  • by adzoox ( 615327 ) * on Thursday December 30, 2004 @09:04AM (#11218256) Journal
    what Initech handles?

    Yeahhhhhh! Mmmmmmkay!

    Did you get that memo?

  • I thought the Comair crashed when Nicholas Cage steered it into the window of a Las Vegas casino.
  • by CodeWanker ( 534624 ) on Thursday December 30, 2004 @09:10AM (#11218281) Journal
    That when you are talking about an airline, a COMPUTER crash is by far the least traumatic kind you can have.
  • by jellomizer ( 103300 ) * on Thursday December 30, 2004 @09:13AM (#11218303)
    It could have worked if it wern't for the 2s complement they would be good for twice what they had. I think programming languages should make numbers unsigned unless asked that way we can take advantage of that extra bit. For things like counters where negitive numbers just wont happen is like having a 15bit number taking 16bits of space.
    • You could also suggest that changing the default behaviour would be a confusing matter. And programmers worth their salt will probably specify signed or unsigned if it's going to matter or if there's an obviously better choice.
      -N
  • by AtariAmarok ( 451306 ) on Thursday December 30, 2004 @09:13AM (#11218305)
    Can't help but remember the scene in one of the "Airplane" movies where a kid sneaks into the hi-tech air traffic control room. He sees one of the airline's shuttle-like planes on a screen, and grabs the nearest joystick and begins to (he thinks) play a videogame.

    When the shuttle on the screen blows up, and is accompanied by a very loud explosion sound outside the building, the kid looks sheepish and sneaks away.

  • by Anonymous Coward on Thursday December 30, 2004 @09:15AM (#11218312)
    So it turned out to be problematic to use a signed 16-bit integer.

    But the real problem is a lack of error checking. It sounds like the code had something like:

    int num_crew_changes; ...
    crew_change_list[++num_crew_changes] = blah;

    And the counter wrapped and the system crashed.

    The code should have said:

    if (num_crew_changes == MAXINT)
    {
    ERROR(E1234, "too many crew changes");
    }

    The system is still degraded after 32767 crew changes. It might be so degraded as to be unusable. But at least the company would know the extent of the degradation and could pull out the appropriate "Plan B". It's much safer and better to work around a known problem of known scope than to work around a system crash when you don't know the exact problem.
    • ...is the management of software companies who ceased using real computer scientists to design and write their apps because disposable code monkeys work for so much cheaper. And outsourcing is even cheaper... in the short term (which seems to be all that matters in this industry anymore)
    • Although the application crashed that is not why Comair needed to cancel all their flights.

      In aviation there is a concept of "Legality" which basically states that the FAA in cooperation with the ICAO www.icao.int has set the allowable hours a pilot or F/A can work in a day. Since this is a crew scheduling application the airline was no longer able to determine the flight status of their crews and by FAA regulations needed to stop flying until they could once again determine their crew(s) flight status.
  • by AtariAmarok ( 451306 ) on Thursday December 30, 2004 @09:16AM (#11218315)
    It's times like this when you begin to realize that the Vic-20 (duct-taped to the bulletin board and surrounded by haywires) might not be the best choice anymore as mission-critical hub of your operations.
    • Oh but didn't you hear? The airline industry is bankrupt. ... Bankrupt to the CEO's million dollar tropical mansion that is.

      If my seat literally costs $300 and they sell it to me for $225 that's just stupid. Even if I only fly that airline it still COSTS THEM MONEY to fly me. ...

      So why don't airlines just charge what it actually costs to fly. The others who don't will die anyways because well they're losing money with each "heart, mind and body" they "win" with lower prices.

      As for this particular bug
      • Because if they DON'T offer some discounted fairs to fill the seats then they have a HUGE fixed cost being spread over even fewer full fare customers. Basically your discount butt pays for the interest on the money it costs to lease the plane, and the only chance of making money they have is business class customers paying full fare. Unfortunatly for the airlines there aren't as many businesses willing to pay full fare, they would rather allow flexibility in their employees schedule than pay the full fare l
        • I agree with your darwinism issue.

          The only reason why the airlines are so popular is because they made it relatively cheap. It's just not the reality of the situation.

          It's a business that truly fills a niche [e.g. travel by air as a last resort] that it should be at least for now. I never shed a tear for businesses that assume it's their right to make billions in profits. This is the same thing as the movie/music industry bitching when people stop shilling over the dough for their latest "product".

          If
  • New CIO? (Score:2, Interesting)

    by Mr. BS ( 788514 )
    I wonder how fast this CIO is going to be on his butt.

    "Well... we were holistically mitigating our financial stance outside the box of current processes while try to forcast our future technological stability within the transport industry."

    "Well... you're fired! NEXT?!
    • Re:New CIO? (Score:4, Interesting)

      by mslinux ( 570958 ) on Thursday December 30, 2004 @10:44AM (#11219015)
      I wonder how fast this CIO is going to be on his butt.

      Probably never. Our CIO is an idiot when it comes to technology. He has a law degree from a big college. He earns six figures for sitting in his office and trading stocks (his stocks) all day. I'll never forget the day he picked up a WordPerfect Office11 box looked at it and then said, "So Tom... who is it that makes Word?"

      These guys are dumber than dirt, but they're well-connected in the "good ole boy club"
  • by Anonymous Coward on Thursday December 30, 2004 @09:18AM (#11218335)
    Having once done tech support for the Maestro program used by Comair (and other scheduling software for other airlines as well), I think the software is junk. The employees undoubtedly said "I told you so!" when it broke, because they hated it as much as the support team did. IMO the airline didn't bother upgrading because they didn't think the old version was broken enough or outdated enough to warrant it.
  • Hmmm.

    why would anybody make ae event counter a signed value?

    short numberScheduleChanges;

    hello?

    unsigned short numberScheduleChanges;

    fixes the problem.

    • Re:unsigned (Score:4, Insightful)

      by rjstanford ( 69735 ) on Thursday December 30, 2004 @09:57AM (#11218615) Homepage Journal
      unsigned short numberScheduleChanges;

      fixes the problem.


      You do realize that you've just fallen into the same trap, right? That doesn't fix the problem worth a damn. I mean, sure it doubles the amount of changes. And yes, 64,000 should be enough. But, hey, 32,000 should have been enough too, right?

      Programs have internal limits. That's kosher. What's not appropriate is allowing the user base to exceed them or - for something like this - come close to exceeding them, without giving some kind of warning that notifies people of an impending problem and provides possible solutions (purge data, etc). Now you may point out that adding that kind of security increases the cost and complexity of software. Yup. That's why true enterprise software is expensive. Because that's what you're paying for.

      Another alternative would have been for the software wrap and start purging existing records to make room for new ones. Either way, there should have been some defined strategy for the boundary condition, and there wasn't.

      The other thing that the software vendor should have done when pushing their upgrade is point out that the previous version wouldn't allow flights to continue in that situation, but the new version expanded it to (some large number). Instead, they probably said, "We're 32 bit!" or something totally meaningless to the people evaluating the business case for the upgrade.
  • Playing with fire (Score:2, Insightful)

    by ravingidiot ( 798346 )
    Why was conair using signed shorts to track their scheduling changes anyway? It seems to me that a company of that magnitude should expect to run into more than 32000 schedule changes within one month more than once. I mean, I can understand that the counter was probably designed with space constraints in mind, but for christs sake, it would've only only been two extra bytes to fix this. That brings the total up to some 4 billion unsigned if I'm not mistaken. Technically, they could've used just three byt
  • Maestro sucks. (Score:4, Informative)

    by Anonymous Coward on Thursday December 30, 2004 @09:23AM (#11218373)
    Maybe Maestro should just die. My friend is a flight attendant for Southwest and has to use Maestro to plan her schedule. To use it she has to citrix into their main server and wait for an open client (I assume they have either a license or horrible programming restriction on concurrent users). On the very day that the new schedules are posted, it can take hours to log in. It's a joke.

    This stuff could be handled by a team of a dozen web based programmers (Java? C? ASP? LAMP? You pick.) in a few months. It's not difficult.
  • When Mike Oldfield recently came out with his Maestro game [sean.co.uk], I did wonder that the "Maestro" name was not used for software before. Now I know better.

    Not only that, a lot of Oldfield's game involves piloting glider- or airplane-like avatars.

  • seems possible with the ever-decreasing quality of journalism seen here...

    "yeah what 'appened right was this computer-me-thingy went all pear-shaped and these silly buggers Comair went and got themselves 'Done In'..."
  • by JavaDev04 ( 844747 ) on Thursday December 30, 2004 @09:30AM (#11218428)
    Hey everybody! Comair is hiring Unix System Administrators and IT Software Engineers! http://www.comair.com/hr/other/ [comair.com]
    • Hey everybody! Comair is hiring Unix System Administrators and IT Software Engineers! http://www.comair.com/hr/other/

      Read to the bottom. They're also hiring a "Staff Scheduler". Only a high school diploma required and 1 year of experience. Maybe they should raise their qualification requirements for this one given recent difficulties ....
  • by hey! ( 33014 ) on Thursday December 30, 2004 @09:40AM (#11218488) Homepage Journal
    back in the early 80's. There was a big financial company that had an automated system that watched the prices of certain commodities and issued automated trade orders. The transactions where stored in arrays addressed by 16 bit signed integers, with the (now) highly predictable result on the first day that trading volume exceeded 16384 transactions. Since in C arrays are just syntactic sugar for pointer arithmetic, the system started executing trades based on "data" from random bits of heap memory. This apprently went on for some time before a human being figured out something had gone wrong, and (reportedly) the company lost billions in a single day. This might be somewhat exaggerated, since the event now has passed into folklore.

    In any case, this is one of those incidents like the Therac-25 accidents that experienced programmers should always have in mind.

  • Delta Airlines has also been gutting its IT (and other) staff in recent years, while offering expensive protections to executives and many other employees making over $200,000/yr.

    I'm wondering when Delta & Comair identified the troublesome software package as something that they should upgrade? The article says that it's going to be upgraded in upcoming weeks, so when did the upgrade get approved?

    Is Comair going to try to go after SBS International for damages?

    Has this issue ever come up before or di
  • Remember last February the Martian rovers were sidelined by a computer crash. The reputed problem was the huge flash memory appeared to fill due to not properly building the free inode list in the Unix OS. This was the realtime VxWorks UNIX from Charles River used by NASA on space probes for 20 years. The Spirit rover went into a perpetual reboot cycle for a day (rebooting its default response to a severe crash). Then it took two weeks to diagnose, repair and test the software patch.

    So the point is, t
  • Happened to me too (Score:5, Interesting)

    by wandazulu ( 265281 ) on Thursday December 30, 2004 @10:07AM (#11218698)
    I worked at a bank in the early 90s that had a trading system based on SQL Server and the client was written in Visual Basic 3. Apart from every other bad design choice in this system (I inherited it when the designers got promoted and started working on another, even bigger system), the all important record counter was an integer, so when trade 32768 was posted, the application crashed, and simply could not be started again, because the first thing it did was try to show the current total (it was written for operators to use, not traders). Worse was that the counter variable wasn't a global, and it was often times a stack variable, and always with a different name (sometimes iCounter, sometimes iCount, sometimes x).

    The upshot was that I was able to convince management to totally scrap it and allow me to write a new one. The downside was that the idiot who designed the original system went on to spend 100 million dollars on this new, grandious system that too was eventually scrapped, but he knew long before that his turkey wasn't going to fly, so he quit and became a lead architect at some other company.

    *Sigh*...okay, back to coding.
  • SIGNED 16-bit!! (Score:3, Interesting)

    by adam31 ( 817930 ) <adam31 @ g m a i l .com> on Thursday December 30, 2004 @12:57PM (#11220333)
    what-- were they expecting negatives also?
    • Re:SIGNED 16-bit!! (Score:3, Informative)

      by pclminion ( 145572 )
      Using a signed integer allows you to distinguish between error and non-error conditions. In UNIX for example, system calls return a negative value on error, so these calls often are declared to return a signed int even though the number they return in a non-error condition will always be positive.

      This might be viewed as laziness depending on the cirucmstances. Obviously, it seems weird to waste half of the integer space just so you can return -1 on error, but if you need to report many various error condi

"The vast majority of successful major crimes against property are perpetrated by individuals abusing positions of trust." -- Lawrence Dalzell

Working...