Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
Bug

Comair Done In by 16-Bit Counter 441

Gogo Dodo writes "According to the Cincinnati Post, the Comair system crash was caused by an overflowed 16-bit counter. Perhaps Comair should have paid for the software upgrade to MaestroCrew." You heard it here first...
This discussion has been archived. No new comments can be posted.

Comair Done In by 16-Bit Counter

Comments Filter:
  • by bje2 ( 533276 ) * on Thursday December 30, 2004 @10:03AM (#11218245)
    from information week [informationweek.com]

    "The computer failure that grounded an airline's entire fleet over the Christmas weekend and stranded thousands of travelers was due to creaky software that couldn't count higher than 32,768." ...

    According to the Post, the software -- which tracks all details of crew scheduling, including how long they have flown (an FAA regulation restricts airtime), and logs every change -- has a 16-bit counter that limits the number of changes to 32,768 in any given month. ...

    to be fair (although it's not an excuse), but 32K crew changes in a month? that's like 1,000 a day? that's crazy!...
  • by Staplerh ( 806722 ) on Thursday December 30, 2004 @10:04AM (#11218250) Homepage
    This was a horrible chain of events that severely inconvenienced a lot of people for Christmas, and I would be hoppin' mad if I was in any of their places. However, let's not jump on ComAir too hard, IMHO. From TFA:

    "This probably seemed like plenty to the designers, but when the storms hit last week, they caused many, many crew reassignments, and the value of 32,000 was exceeded," he said.

    It's true, it was an extreme connection of circumstances... horrid weather (heck, there was snow in some Texas town for the first time in like 80 years or something, read it in some glurge article) coupled with the winter holidays. They should redesign their system and admit that they've grown to a level where their system is unable to hand extreme circumstances, and this should serve as a great wake-up call for them.

    In the past I've always chuckled at the thought of 'upgrading for the sake of upgrading', but I suppose this is one case where an earlier upgrade could have saved them millions and made a lot of people's holidays better.
  • by wiredog ( 43288 ) on Thursday December 30, 2004 @10:04AM (#11218251) Journal
    I stopped using 16 bit ints for anything 10 (or more) years ago when I had the joy of migrating systems from a 16 bit OS to a 32 bit OS.
  • New CIO? (Score:2, Interesting)

    by Mr. BS ( 788514 ) on Thursday December 30, 2004 @10:16AM (#11218319)
    I wonder how fast this CIO is going to be on his butt.

    "Well... we were holistically mitigating our financial stance outside the box of current processes while try to forcast our future technological stability within the transport industry."

    "Well... you're fired! NEXT?!
  • by mikesmind ( 689651 ) on Thursday December 30, 2004 @10:19AM (#11218343) Homepage
    Legacy systems will often contain such hard limits. Usually, they are buried deep in the code and sometimes no one knows that they exist. Any point where such hard limits exist must be discovered. A solution needs to then be designed for each situation. If you are a manager or a maintainer of such a system, it is your responsibility to do this. When you are questioned, just point out the Comair computer disaster.
  • by hey! ( 33014 ) on Thursday December 30, 2004 @10:40AM (#11218488) Homepage Journal
    back in the early 80's. There was a big financial company that had an automated system that watched the prices of certain commodities and issued automated trade orders. The transactions where stored in arrays addressed by 16 bit signed integers, with the (now) highly predictable result on the first day that trading volume exceeded 16384 transactions. Since in C arrays are just syntactic sugar for pointer arithmetic, the system started executing trades based on "data" from random bits of heap memory. This apprently went on for some time before a human being figured out something had gone wrong, and (reportedly) the company lost billions in a single day. This might be somewhat exaggerated, since the event now has passed into folklore.

    In any case, this is one of those incidents like the Therac-25 accidents that experienced programmers should always have in mind.

  • Happened to me too (Score:5, Interesting)

    by wandazulu ( 265281 ) on Thursday December 30, 2004 @11:07AM (#11218698)
    I worked at a bank in the early 90s that had a trading system based on SQL Server and the client was written in Visual Basic 3. Apart from every other bad design choice in this system (I inherited it when the designers got promoted and started working on another, even bigger system), the all important record counter was an integer, so when trade 32768 was posted, the application crashed, and simply could not be started again, because the first thing it did was try to show the current total (it was written for operators to use, not traders). Worse was that the counter variable wasn't a global, and it was often times a stack variable, and always with a different name (sometimes iCounter, sometimes iCount, sometimes x).

    The upshot was that I was able to convince management to totally scrap it and allow me to write a new one. The downside was that the idiot who designed the original system went on to spend 100 million dollars on this new, grandious system that too was eventually scrapped, but he knew long before that his turkey wasn't going to fly, so he quit and became a lead architect at some other company.

    *Sigh*...okay, back to coding.
  • by Anonymous Coward on Thursday December 30, 2004 @11:24AM (#11218858)
    In 1989 I was running GNU Emacs on a 386/20 with 16MB of memory running Unix, and I shared with one other person. I think this was a few years after, I think, Richard Stallman said he'd be surprised if Emacs would ever run on an x86 platform.

    Sure, we had serial terminals instead of graphical displays, but still, a 16-bit signed int for memory conservation in 1989? That's a load of BS.

    I wouldn't be surprised if they did it because they never gave it a second thought to use an int, or they needed to save on-disk storage space. Maybe their database backup wouldn't fit onto a 1.2MB 5.25" floppy otherwise.
  • Re:Comair? (Score:2, Interesting)

    by amabbi ( 570009 ) on Thursday December 30, 2004 @11:34AM (#11218935)
    Now my question would be, since they're owned by Delta, why wouldn't Comair flights be handled within Delta's own reservation/flight tracking system?

    There probably isn't any reason to. Comair, as a regional jet carrier, has separate crew contracts and crew rules than Delta, a mainline carrier. Thus they operate completely different types of jets, with different crew staffing requirements. The FAA crew rules might even be different. While it might make sense from a consolidation standpoint to merge the two systems of Comair and Delta, since in reality there would be no interaction and no overlap between the two systems (an RJ pilot isn't suddenly going to jump over to fly a 757) the expense isn't worth it.

    p.s. I've traveled through CVG, on Delta, during the holidays. Not anymore... One weather-delayed flight and the whole system falls apart.

    Then I hope you also avoid United/United Express/Ted at O'Hare/Denver, Continental at Newark/Houston, Northwest in Detriot, USAirways in Cincinnati, American at O'Hare/Dallas... etc. etc. Every airline, not just Delta, uses hubs, and ground stops at any of these airports will cause significant delays. That's just the reality of air travel these days; if you're really worried, book non-stop travel (and pay up to 10x more).

  • by GarrettZilla ( 103173 ) on Thursday December 30, 2004 @11:38AM (#11218966)
    I disagree. Which takes more memory - adding a couple more bytes to this counter, or putting in the code to check for maximum value exceeded and emit a message saying "Cannot perform more than 32767 crew reassignments in a single month."? Or did they just press on into unspecified behavior after an integer overflow?

    Hell, just putting the word "unsigned" in front of "int" (do you really need to tally a negative number of crew reassignments?) would have prevented this particular problem and given double the capacity, all else being equal. If you're worried about memory usage, it's certainly not a good idea to waste a bunch of bits on unneeded signs.

    By 1989, we certainly knew that the world was in the habit of using software for ten years or more. Software was being modified all the time as larger memory spaces and requirements came along. It was practice long before that to be explicit about memory-related design decisions because you knew it would be your problem in five years to update the software. Unless you just ran away and quit before the problem came up, and it was somebody else's worry.
  • Re:New CIO? (Score:4, Interesting)

    by mslinux ( 570958 ) on Thursday December 30, 2004 @11:44AM (#11219015)
    I wonder how fast this CIO is going to be on his butt.

    Probably never. Our CIO is an idiot when it comes to technology. He has a law degree from a big college. He earns six figures for sitting in his office and trading stocks (his stocks) all day. I'll never forget the day he picked up a WordPerfect Office11 box looked at it and then said, "So Tom... who is it that makes Word?"

    These guys are dumber than dirt, but they're well-connected in the "good ole boy club"
  • by MadHungarian1917 ( 661496 ) on Thursday December 30, 2004 @12:41PM (#11219565)
    Although the application crashed that is not why Comair needed to cancel all their flights.

    In aviation there is a concept of "Legality" which basically states that the FAA in cooperation with the ICAO www.icao.int has set the allowable hours a pilot or F/A can work in a day. Since this is a crew scheduling application the airline was no longer able to determine the flight status of their crews and by FAA regulations needed to stop flying until they could once again determine their crew(s) flight status.

    The only plan B the FAA alows is flght cancellation unless the airline has an approved greaseboard process.

    The int overflow was just the nail in Ben Franklin's
    famous poem

    For want of a nail a shoe was lost
    for want of a shoe a horse was lost
    for want of a horse a rider was lost
    for want of a rider the battle was lost
    for want of a battle the kingdom was lost
    and all for the want of a horseshoe nail.
    (or long counter)
  • by fm6 ( 162816 ) on Thursday December 30, 2004 @01:00PM (#11219757) Homepage Journal
    If you're running software that's more than a decade old you need to know what the limits of your software are.
    Indeed. I get the impression that Boeing is very unmotivated when it comes to keeping its IS technology up to date. Until recently, they were still using slips of paper to track the process of assembling their airplanes!

    What's particularly disturbing is that nothing was done about this during the big Y2K push 5 years ago. Of course, the official goal of Y2K efforts was to make sure your computers didn't crash on 1/1/2000. But it's pretty hard to separate Y2K bugs from other clock bugs, and I think most places didn't even try. Easier to fix or document the bugs than to classify them. I was involved with the Y2K effort at SGI, and we looked at everything from leap year bugs to the Unix 16-bit clock overflow -- which won't occur until 2038!

  • SIGNED 16-bit!! (Score:3, Interesting)

    by adam31 ( 817930 ) <adam31.gmail@com> on Thursday December 30, 2004 @01:57PM (#11220333)
    what-- were they expecting negatives also?
  • Re:Well... (Score:3, Interesting)

    by HiThere ( 15173 ) * <charleshixsn@@@earthlink...net> on Thursday December 30, 2004 @06:43PM (#11222939)
    Actually, it *MIGHT* have. No guarantees.

    But it would have been much easier to fix.

    The problem here is that even though "with enough eyes, all bugs are shallow", this application is specialized enough that there might well not have been enough eyes. Still, if it were open, then the people who work there might have spent some time looking through it. And *MIGHT* have found the problem.

    OTOH, with open source, when the problem manifested, it could have been debugged and recompiled starting immediately. This might well have saved they several days, certainly several hours.

    OTOH, if it were FOSS, then it would be their job to maintain it. This would increase the chance of the problem being fixed, but decrease their ability to point fingers at someone else and say "It's all their fault!", which is the capability that they seem to most desire.
  • by dbIII ( 701233 ) on Thursday December 30, 2004 @07:31PM (#11223359)
    Essentially, the major carriers are hamstrung by the unions,
    It's often easier to blame the unions than fix bad management practices, and in a lot of cases competitors that have workers in the same union operate well and get the job done instead of throwing their hands in the air, blaming the unions, and refusing to move into this century.

    Unions are also used to seeing promises that pay will go back to normal in good times broken. The reality of capitalism is that if you don't have your act together enough to meet a known wages bill your company is probably going to expire.

    While unions are not the problem - some unreasonable bastard in a paticular union may be - but you get that in all kinds or organisations. It sounds like there is an "us or them" attitude going on, where each group hates the other, which can lead to all kinds of problems and the end of the company if it isn't sorted out.

    The USA has all kinds of protections to stop better run airlines coming in from overseas to create even more intense competition. The land that gave us Valuejet and the mess that was United in its final years really needs to get its act together, stop blaming the unions, and see if they can do as well as any of a score of airlines that would be happy to come in as soon as deregulation happens and show how airlines work in the rest of the world.

    When I go to the USA I'd better catch a bus, I bet the bus companys scheduling software is less than fifteen years old and has been updated if the company has grown - that's what most places do for business critical applications, and it has nothing at all to do with unions.

HELP!!!! I'm being held prisoner in /usr/games/lib!

Working...