Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
IT Technology

Risk Management - A Cautionary Tale 203

Mr. Ghost writes "By now many people have heard about the fiasco and financial blunder Comair had over the 2004 Christmas holiday. An article on CIO provides a timeline of the decisions that led up to the system failure costing the division of Delta Airlines $20 million. The article points out the need for proper risk management and what can occur when a risk analysis is not performed or ignored. It goes on to mention that although this was a very public failure, this type of system failure can occur in other companies." From the article: "The prospect of replacing the ever-maturing crew management system was floated again the following year, with plans laid out to select a vendor in 2000. But that didn't happen. Over the next several years, Comair's corporate leadership was distracted by a sequence of tumultuous events..."
This discussion has been archived. No new comments can be posted.

Risk Management - A Cautionary Tale

Comments Filter:
  • by winkydink ( 650484 ) * <sv.dude@gmail.com> on Tuesday May 03, 2005 @01:29PM (#12422025) Homepage Journal
    Yes, senior management was distracted, but it's the CIO's job to warn senior management and the board about risks to the business as well as their liklihood of happening.
    • by Anonymous Coward
      I can't help but commiserate with the folks at Comair. Technology projects can be hard enough without having to deal with labor unions - which is really key to understanding Comair's problem. I was the project manager in the late 90's at TWA, hired to implement just a portion of what Comair is trying to replace. Scheduling systems hit at the heart of the pilot's work rules and they won't give up a single work rule without a fight. That was true even when the union was the instigator of the change. Even aft
    • but it's the CIO's job to warn senior management and the board about risks to the business as well as their liklihood of happening

      The CIOs job is defined by the investors and management. Not by a slashdot post or even a standard definition. If the CIO is given many other things and told they are his priorities by the proper people, those things are his priorities. However, a good CIO would make risk management a priority with his company and if he could not...he would seek employment while he still had a
    • by swb ( 14022 ) on Tuesday May 03, 2005 @02:14PM (#12422576)
      I work in a business that isn't defined by technology (at least not historically), and I don't think that management actually listens or comprehends when it comes to a lot of IT issues.

      When they do listen, they tend to reduce it to profit/loss and destroy the subtlety of the information and its meaning. CIOs that "push" issues, especially when they're expensive, tend to get canned as gadflys, big spenders or for not being "team players".

      When it comes to technology, managers often don't care and don't want to know, except when it costs money.
      • When it comes to technology, managers often don't care and don't want to know, except when it costs money.

        That's their job. Companies exist to make money - end of story. Technology for technologies sake is foolish and wasteful unless you're in an R&D department.

        That being said, all technology spends (e.g. upgrades, redesigns, rewrites, replacements, etc.) can and should be boiled down to dollars that either fall into a profit/loss or risk/benefit catagory (hopefully over 3-5 years). If a CIO

    • I've worked in three large IT shops now, and the CIO of each company would typically know *very* little about a given application besides its name (if even that), much less info about specific features or flaws therein.

      When one works in an environment with several hundred in-house applications, it's easy for something to get lost in the shuffle, paricularly if the application in question isn't normally a source of issues or is using a techology which isn't "mainstream" for the company...

    • by Angostura ( 703910 ) on Tuesday May 03, 2005 @03:40PM (#12423966)
      I'm sorry, but this story and your comment annoy me greatly.

      Here's the situation. The company had an old green screen application that was working just fine. It was old, but it did what the company needed. There was no hint that there was any fault.

      Now, one day the company had to cancel 90% of its flights - and whammo some double byte counter overflowed.

      What's all this crap in the article about old software "getting brittle"? This wasn't brittle aging software, this was software that was hit by an event that took it outside of its design parameters.

      How would *you* have judged the risk of this software failing? How would that risk compare with the risk of installing a new untested package?

      • by Knara ( 9377 ) on Tuesday May 03, 2005 @04:56PM (#12424900)
        I'd agree, but the fact that it was written in FORTRAN and they didn't have a single maintenance developer (even if it wasn't that developer's primary role) assigned to it that *knew* FORTRAN suggests that a whole lot of "buhhhhhh??" going on in that particular IT department.
        • At an old job, we supported a Fortune 500 company that had an application written in assembly. Nobody at the client company knew any assembly. They hadn't the faintest idea how it worked - they just knew that it did.

          I suspect this situation repeats itself in many companies.
    • Because the system was in fact working, and there was no way for him to know about 16bit values being used in the software. It was a latent problem that would not degrade over time, it just completely broke one day.

      Perhaps the docs for the software would indicate this problem. Did anyone RTFM at Comair?

  • by Megaweapon ( 25185 ) on Tuesday May 03, 2005 @01:30PM (#12422028) Homepage
    but all it takes for a good number of companies to get egg on their face is one careless mid-level that is too casual with passwords (and/or takes their work home on laptops with info unencrypted)...
  • by LegendOfLink ( 574790 ) on Tuesday May 03, 2005 @01:39PM (#12422122) Homepage
    Um...like making sure you run your Windows Updates. Because if you don't, you're gonna regret it.

    Then again, even if you do, you're still going to regret it.

    So, I guess the moral of the analogy is that it's better to patch your system and risk your hardware not working properly than having spyware or a virus on your system.
  • by rewinn ( 647614 ) on Tuesday May 03, 2005 @01:40PM (#12422133) Homepage

    From the article:

    As it turned out, the crew management application, unbeknownst to anyone at Comair, could process only a set number of changes--32,000 per month--before shutting down.

    Sounds like some sort of overflow problem. Hmmm....

    The big issue is, of course, the business units and IT playing "After you, Alfonse..." but it's fun to seek out the pebble that set off the avalanche.

    • And yet the idiot from EDS has this to say:

      "These systems are just like physical assets," says Mike Childress, former Delta CTO and now vice president of applications and industry frameworks for EDS. "They become brittle with age, and you have to take great care in maintaining them."

      You can easily run software for 20 years and it will not fail so long as you don't exceed its operating parameters. That's also assuming you can source replacement kit for hardware failures.

      Software does not age.
      • by josecanuc ( 91 ) on Tuesday May 03, 2005 @02:30PM (#12422819) Homepage Journal
        Exactly... The article author seems to point to the fact that the software was old and just waiting to die...

        Becase of the fact that NO ONE knew of the particular limit that was exceeded, those who were supposed to calculate risk never knew what the tipping point was.

        All they could say was "our software is old, someday it may not work any more, but I cannot say for what reason, because I do not know FORTRAN."

        How the hell can you calculate risk if your only input is the chronological age of a software system?
        • by BattleTroll ( 561035 ) <battletroll2002@yahoo.com> on Tuesday May 03, 2005 @03:24PM (#12423750)
          "How the hell can you calculate risk if your only input is the chronological age of a software system?"

          That wasn't the the only input in this case. In fact, you don't have to know the gory details of the implementation to determine risk, just the business impact of a problem to the system.
          • Since no one at the company understood the language used, it stands to reason no one understood what the system was doing. Risk: Medium
          • The system was mission critical to the performance of almost every other function of the airline. If the system was lost, the airline was hosed. Risk: Critical
          • They had no failover plan in place in case the system went down. Risk: High
          • No load tests were possible since they only had the one system in place. Without load testing the only way to find out the system fails under load is to wait until it fails in production. Risk: High
          It stands to reason there were other risks involved that weren't identified in the article.
          • by ScuzzMonkey ( 208981 ) on Tuesday May 03, 2005 @03:51PM (#12424087) Homepage
            "They had no failover plan in place in case the system went down."

            With that, you've hit the heart of the matter, and what the article should have focused on rather than the "old software breaks down" BS. This was a bug which could have hit at ANY time since the software was installed; it was an overflow, not a rusting subroutine that fell off. I can't personally see any way that they could have foreseen this particular problem but when you have a system that is so critical to your operation, you don't look for problems it might have--you look for alternatives to fall back to when it DOES have problems.

            You never see them coming. But you'd better plan for them anyway.

            • First, the longer a piece of software is in use, the greater the chance of finding an obscure or unlikely error condition. The older a piece of sofware, the more of its bugs will become apparent, and the more likely it is that a crippling bug will be found. Old software breaks down.

              Second, operating constraints change over time. If a piece of software meets its initial demands, greater and greater demands are placed on it over time. If a piece of software is kept in use for many years, it will likely f
      • >>"These systems are just like physical assets," .... They become brittle with age,

        >That's also assuming you can source replacement kit for hardware failures.

        And how the hell is that different from what he said?

        (Systems = hardware + software)
      • by Peter La Casse ( 3992 ) on Tuesday May 03, 2005 @03:32PM (#12423859)
        Software does not age.

        Software does age. As a program grows older, people change it, its inputs and how it is used, and the older a program gets, the less the people making the changes are likely to understand it.

        In addition, some bugs don't manifest themselves under usage patterns from 20 years ago, or when the software is run on hardware from 20 years ago, but they do manifest themselves under usage patterns or on hardware that's in use now. The more you change, especially without understanding all of the ramifications of that change, the greater the risk for error.

        That's what software aging is.

        • Who exactly was changing the code, since no one there supposedly knew FORTRAN?

          And usage patterns are not conveniently tied to time, but rather, well, usage. The airline could have hit it big two months after this package was deployed and run into the exact same bug.

          I think it is an easy to sell analogy to people who work on airplanes, but in fact software does not "age", and treating it as if it does is a fundamental risk factor in and of itself because doing so invites a complete misunderstanding of why
          • Who exactly was changing the code, since no one there supposedly knew FORTRAN?

            It is the system that changes when software ages; the system is comprised of software, hardware, data, documentation, users and business practices. It's not necessary for every one of those to change in order for change to occur.

            in fact software does not "age"

            This is a commonly held myth, and it leads people to think that maintenance of software-based systems is not necessary. That's a big mistake.

            Software doesn't we

            • Then it would be "system aging" not "software aging"; the mistake is in your choice of terminology. "Aging" in fact is still a terrible term to use to describe this process, since it has little or nothing to do with time and everything to do with modification and utilization. It's hardly a foregone conclusion that software-based systems will deteriorate without maintenance. You're making a poor generalization based on innaccurate assumptions of utilization across the board.

              It escapes me why people feel
    • Sounds like some sort of overflow problem. Hmmm....

      That depends. I suppose you could call the software involved here mission-critical. In that case one might expect limits like the ~32000/month to be documented (not in this case if I read it right). If that limit had been documented, then the failure would not have been overflow, but not RTFM/using the system out-of-spec, which is management/operator error.

      Also it matters how exceeding a limit is handled (graceful degradation). Did this system say: "I'

      • >it matters how exceeding a limit is handled (graceful degradation)

        Your point on correct software design is exceedingly well taken ...

        ... but I just love the term "Graceful Degradation". Is it from Faulkner, or a New Wave band?

  • Article text (Score:5, Informative)

    by daVinci1980 ( 73174 ) on Tuesday May 03, 2005 @01:40PM (#12422135) Homepage
    Site is already sluggish.

    Bound To Fail
    The crash of a critical legacy system at Comair is a classic risk management mistake that cost the airline $20 million and badly damaged its reputation.
    BY STEPHANIE OVERBY

    When Eric Bardes joined the Comair IT department in 1997, one of the very first meetings he attended was called to address the replacement of an aging legacy system the regional airline utilized to manage flight crews. The application, from SBS International, was one of the oldest in the company (11 years old at the time), was written in Fortran (which no one at Comair was fluent in) and was the only system left that ran on the airline's old IBM AIX platform (all other applications ran on HP Unix).

    SBS came in to make a pitch for its new Maestro crew management software. One of the flight crew supervisors at the meeting had used Maestro, a first-generation Windows application, at a previous job. He found it clumsy, to put it kindly. "He said he wouldn't wish the application on his worst enemy," Bardes recalls. The existing crew management system wasn't exactly elegant, but all the business users had grown adept at operating it, and a great number of Comair's existing business processes had sprung from it. The consensus at the meeting was that if Comair was going to shoulder the expense of replacing the old crew management system, it should wait for a more satisfactory substitute to come along.

    And wait they did. The prospect of replacing the ever-maturing crew management system was floated again the following year, with plans laid out to select a vendor in 2000. But that didn't happen. Over the next several years, Comair's corporate leadership was distracted by a sequence of tumultuous events: managing the approach of Y2K, the purchase of the independent carrier by Delta in 2000, a pilot strike that grounded the airline in 2001, and finally, 9/11 and the ensuing downturn that ravaged the airline industry.

    A replacement system from Sabre Airline Solutions was finally approved last year, but the switch didn't happen soon enough. Over the holidays, the legacy system failed, bringing down the entire airline, canceling or delaying 3,900 flights, and stranding nearly 200,000 passengers. The network crash cost Comair and its parent company, Delta Air Lines, $20 million, damaged the airline's reputation and prompted an investigation by the Department of Transportation.

    Chances are, the whole mess could have been avoided if Comair or Delta had done a comprehensive analysis of the risk that this critical system posed to the airline's daily operations and had taken steps to mitigate that risk. But a look inside Comair reveals that senior executives there did not consider a replacement system an urgent priority, and IT did little to disrupt that sense of complacency. Though everyone seemed to know that there was a need to deal with the aging applications and architecture that supported the growing regional carrier--and the company even created a five-year strategic plan for just that purpose--a lack of urgency prevailed.

    After the acquisition by Delta, former employees say Comair IT executives didn't do the kind of thorough management analysis that might have persuaded the parent airline to invest in a replacement system before it was too late. Instead, Delta kept a lid on capital expenditures at Comair, with unfortunate consequences. The failure of the almost 20-year-old scheduling system not only saddled Delta with a plethora of customer service and financial headaches that the airline could ill afford but it also provides a cautionary tale for any company that thinks it can operate on its legacy systems for just...one...more...day.

    The five-year plan that wasn't
    Today, Cincinnati-based Comair is a regional airline that operates in 117 cities and carries about 30,000 passengers on 1,130 flights a day, with three or four crew members on each. But back in 1984, when Jim Dublikar joined the company as director of finance and risk management, Comair had
  • by Anonymous Coward on Tuesday May 03, 2005 @01:42PM (#12422160)
    --------------------- Cut Here ---------------------
    Posts above this line have not RTFA.
  • by hellfire ( 86129 ) <deviladvNO@SPAMgmail.com> on Tuesday May 03, 2005 @01:43PM (#12422177) Homepage
    Okay, like many slashdotters, I have a short attention span and I don't remember this "public" story about Comair committing this blunder.

    I have a real question. Why did Comair's system fail in the first place? Was it due to a design flaw requiring it's replacement in 2004? Was it an irreplaceable piece of hardware which died?

    The Article smacks of FUD, only because systems fail for a reason. The article conveniently leaves out the reason for the failure. I think this is critical to any risk analysis. For example, if I have a 20 year old system that I can't get parts for, that's a high risk system. However, if I can get parts for a 20 year old system, then the risk is lower.

    I don't like the idea of making assumptions that just because a system is 20 years old, that it absolutely must be replaced. I also don't like the assumption in the article that I already know the facts, so here's the analysis for you. I want the facts to back it up so I can come to my own conclusion.
    • Why did Comair's system fail in the first place?

      If I understand the article correctly, the database could only handle 32,000-odd transactions in a month. In December 2004, rescheduling caused by bad weather caused the database to hit its limit exactly on Christmas Day, and everything shut down. It wasn't until December 29th that everything was back up again.

      Oh, and they're still using the old system: they've divided the database up, with each half having its own 32,000-transaction limit, but that's about
      • So it does sound like, essentially, they likely didn't specifically how it was going to break, but simply had the mindset of 'it's old, something's going to break, we need to refresh this'. That would be dangerous thinking if it is true, it just happened to be the case this time.

        The reaction most people are having is to say 'code is 20 years old, throw it out and redo it right!' which is a really bad philosophy for proven systems. In this case, for example, the prudent response is to examine the code and
    • by Jayfar ( 630313 ) on Tuesday May 03, 2005 @01:52PM (#12422279)
      The article conveniently leaves out the reason for the failure.

      No, the article conveniently explained that the sw had a limit of 32000 schedule changes per month. A severe winter storm necessitated enough changes to make the system fall over.

    • With a signed 16-bit integer, you have 1 bit for the sign, and 15-bits for the rest of the number. Depending upon any error handling by the compiler, you could get NaN (not a number), maybe zero, maybe -32767, or maybe just a core dump. In any event, the result is not what you are expecting.
    • The system had an interanal 'flaw' that limited it to a maximum of 32,000 changes per month before it would crash. Due to increases in size over the years and the large number of scheduling changes made during the Christmas holiday season this maximum was reached and the system crashed.

      So there were really two design/coding flaws that caused the crash. First, the limit on the number of changes. Second the lack of proper error handling when the maximum number of changes was reached. So it took both of these
    • that your comment has failed: Lack of attention
    • > I don't like the idea of making assumptions that just because a system is 20 years old, that it absolutely must be replaced.

      >...if I have a 20 year old system that I can't get parts for, that's a high risk system.
      > However, if I can get parts for a 20 year old system, then the risk is lower.

      Good points. The article does contain some facts, though. The system was Fortran based, ran only on one aging hardware platform, and no one at Comair knew Fortran. Those are risk factors with older software.
    • One of the major drivers to replacing older systems is in-house programming knowledge. It's not enough that you may not have Cobol/Fortran/Business Basic developers on hand who intimately know the legacy code. You may not have *any* competent developers on staff at all for those languages, because the market for them might be the size of an ant's navel. Heck, you might not even own the code itself.

      Even if you do have a couple, they'll be older and likely not replaceable at retirement. Documentation is help
    • I don't like the idea of making assumptions that just because a system is 20 years old, that it absolutely must be replaced. I also don't like the assumption in the article that I already know the facts, so here's the analysis for you. I want the facts to back it up so I can come to my own conclusion.

      How about this: every few years, reexamine the limitations and requirements of the system. Upgrade or replace the system when it gets too close to those limits.

  • And even though she was screaming from the highest mountain to anyone and everyone that would listen that doom was rushing towards them. That bad, bad things were going to happen. She was still made the sacrificial goat when the fecal material hit the rotating blades.

    And this was for a federal agency.

    Scary no?

  • by justanyone ( 308934 ) on Tuesday May 03, 2005 @01:48PM (#12422235) Homepage Journal

    I used to work in the Risk Management department of the capital markets division of a large international bank [jpmorganchase.com] as a programmer.

    When I started, 4 years ago, the reports generated were basically compilations by a cut-and-paste-monkey staff (despite being highly trained, very conciencious individuals) of reports generated by other departments. I was part of a team that reformed the IT basis for creating risk reporting, and found that while there was a lot of expertise and complex methods available, what was actually implemented was much much smaller for the simple reason that it was tough to get the right reports generated given the inputs the department was given.

    The project I worked on parsed the input data from the Excel spreadsheet inputs and loaded it to a database, where it could then be queried intelligently and nice reports generated. These reports were growing very fast in complexity, building towards the best toolsets available for determining the actual risk the bank was taking.

    Several points about this job were fascinating:
    1. How much many departments are so caught up in the minutae of "getting the report out" that they don't have time to examine the contents of it;
    2. How much money can be made by knowing what the actual risk is. If you don't know the risk, you estimate high, and put lots of dollars in a reserve account. If you do know the risk accurately, you usually can greatly lower reserves to accurately meet even very bad case estimated losses, and use the rest of the money to fund interest-generating ventures.
    3. How much the banking consolidation trend is increasing, due to the repeal of glass-steagal (sp?) allowing multi-state banks to gobble and grow. This makes a consumer's life better because of more resources being available (auto-bill-pay, check images, etc.

    It was a fun job. Then I found another one where I get to play with Python!

    -- Kevin
    • > I used to work in the Risk Management department of the capital markets division of a large international bank [jpm[*cough*].com] as a programmer
      >
      >[...]
      >
      >It was a fun job. Then I found another one where I get to play with Python!

      Huh? The story's supposed to end with the line "VAXen, my children, just don't belong some places." [syr.edu] :-)

    • just out of curiosity, where do you work now that you get to work with Python?

      -Jay
    • 2. How much money can be made by knowing what the actual risk is. If you don't know the risk, you estimate high, and put lots of dollars in a reserve account. If you do know the risk accurately, you usually can greatly lower reserves to accurately meet even very bad case estimated losses, and use the rest of the money to fund interest-generating ventures.

      This approach usually defines risk independently (typically as variance around a mean) for each individual item. The items are then observed (or just a
      • usually defines risk independently

        The reality of modern banking risk management (in my experience at Bank One, which became JPMC) was that there were many different measures of risk attached to each exposure. The popular ones are the standard short term 'delta', or DV01, which measures a specific 1-day interest rate risk, gamma, vega, etc.

        There's also something called stress testing, and it usually involves lots of cycle time to run (we ran it over weekends). This would take several scenarios, includin
  • software decays (Score:5, Interesting)

    by ecklesweb ( 713901 ) on Tuesday May 03, 2005 @01:51PM (#12422258)
    One of the interesting quotes from the article:

    Unfortunately, you can't see a crew management system age the way you can see an airplane rust. But they do.

    I find that an interesting if not slightly obvious insight. The interesting part is that you can know that software is decaying, but I don't know of any effective way to measure that decay. I don't even know of any particularly good ways to characterize the decay. It's not as if new defects are being introduced into code that's not changing. But the environment in which the software operates changes, and that change is analagous to weather corroding a pieces of physical equipment. Every time the OS gets a patch, the filesystem changes, a shared library is upgraded, the underlying hardware changes, there's a chance of triggering a failure in the software.

    Can it be proven, or should we otherwise reasonably believe, that the probability of catastrophic system failure approaches 1 as the age of the system increases? Maybe a good topic for a research paper...
    • Maybe now's a good time to put in a plug for the RISKS "Forum On Risks To The Public In Computers And Related Systems. [ncl.ac.uk]"
      It sounds academic, but it's full of level-headed dissection of all kinds of software-related disasters, ranging from the hilarious, like the USS Yorktown dead in the water [ncl.ac.uk] after a divide by zero, to the horrifying. The contributors are skeptical but polite, and I learn new stuff with every issue.
    • Re:software decays (Score:4, Insightful)

      by adjuster ( 61096 ) on Tuesday May 03, 2005 @02:55PM (#12423264) Homepage Journal

      But the environment in which the software operates changes, and that change is analagous to weather corroding a pieces of physical equipment. Every time the OS gets a patch, the filesystem changes, a shared library is upgraded, the underlying hardware changes, there's a chance of triggering a failure in the software.

      It's rather sad, to me, that we design these wonderful machines that can perform logical operations in great quantities with a high degree of repeatability and low occurance of failure, then create a culture around them that encourages sloppiness, and ultimately introduces a large measure of uncertainty into the operation of these machines. I am baffled at the perverse desire-- nay need-- that people seem to have to make software suffer from entropy.

      The only "decay" in software should happen as a result of changing business requirements. There's no reason that, provided the business requirements don't change, that a well designed and properly implemented piece of software should not be usable in perpetuity. There may be changes in the underlying hardware and operating system software, but provided that the application is sufficiently abstracted from the underlying platform (or, provided that an emulation-layer for the original platform can be constructed) there's no reason other than changing business requirements for software to be "thrown away".

      Let's put this a different way: How does a patch to the underlying operating system cause an application to fail? If the patch changes the behaviour of the underlying operating system in such a manner as to return unexepected values to the application, the patch is the cause of the failure. A flawed patch doesn't make an application "age" or "decay"-- it's simply a flawed patch. An application has to make assumptions about the underlying operating system. These assumptions are based on the API documentation-- the contact between the operating system and the application. When the OS violates the terms of the contract, that doesn't mean the application "decayed"-- it means some moron who coded the operating system patch messed up, and the operating system manufacturer/maintainer didn't perform good regression testing.

      We should be designing software systems with 10 to 20 year usability goals. It would do a lot for the frustration level that the "suits" have with IT if we stopped being proponents of hugely expensive but "throwaway" systems, and started designing systems with an eye for longevity.

      • Re:software decays (Score:5, Insightful)

        by hawaiian717 ( 559933 ) on Tuesday May 03, 2005 @03:02PM (#12423399) Homepage
        The only "decay" in software should happen as a result of changing business requirements.

        Exactly. This software would have failed the month after it was installed if Comair had needed to do 32,001 changes in that month. But when it was installed, Comair wasn't that big, so having to do that many changes was not something that was considered. Now that Comair has grown considerably, the business requirement has changed but the application has not kept up.

      • Re:software decays (Score:2, Informative)

        by qwijibo ( 101731 )
        The problem can also occur because the original application is tested against the real system, not the documented API. So a bug fix to the underlying system can both be correcting a bug and create an application error.

        Throwaway systems are cost effective in the short term. That makes them popular with people who look at this quarter's stock price as both a goal an duration of their attention.
    • But the environment in which the software operates changes, and that change is analagous to weather corroding a pieces of physical equipment. ... The interesting part is that you can know that software is decaying, but I don't know of any effective way to measure that decay.

      There's a relatively easy way to measure such "decay". When first designing to software do proper requirements gathering and write a full formal requirements specification (there are specification languages specifically for this purpo
    • I find it a rather humorous 'insight'. If the airplanes are 'rusting' then they don't need to worry about software.

      Modern airplanes don't rust. They die of metal fatique, which aluminum is much more prone to than 4130 steel.

  • /.ed (Score:5, Funny)

    by christoofar ( 451967 ) on Tuesday May 03, 2005 @01:52PM (#12422278)
    Wow. Looks like even the mag for CIOs can't keep up with a /. DDoS attack. Maybe the CIO for CIO should be fired?
  • by NetNinja ( 469346 ) on Tuesday May 03, 2005 @01:54PM (#12422306)
    If the crew scheduling system was old as the hills how old is the system used to track aircraft maintenance? Oh wait that issue will be addressed when we crash an aircraft.

    Maintenance manuals and procedures are written in blood. The next tragedy will be no different.
  • by morryveer ( 870752 ) on Tuesday May 03, 2005 @01:59PM (#12422377) Homepage
    Legacy == Bad, gonna die, just like dear Grandad. Should've rewritten it in Java, that'd fix it!
  • by argoff ( 142580 ) on Tuesday May 03, 2005 @02:05PM (#12422449)
    It is always easy to say "I told you" so after the fact, but the reality is that this failure has far more to do with the companies attitude about technology than failure of somebody to say "look out!". In fact by the sounds of it, the entire application could probably be ran on 2 souped up PS'c running in parallel in different co-locations over the internet - the hardware and infrastructure would not cost alot.

    Even worse, is when these types of failures happen, then comes in the ole "policy and procedure" routine kicks in.

    To tell a story, one time I went to a boarding school, and at the beginning of the year they had almost no rules, and then when ever something went wrong they added a new rule. Well needless to say at the end of the year there were so many rules, people could get repramanded for flushing the toilet twice instead of once! Not having their shoes tied left over right, etc .....

    Well I grew up and found the same is true in companies, how much you wanna bet they are gonna loose more than 20 million from too many piled up policy and procedures that keep anyone from getting anything done?
  • Risk management (Score:3, Insightful)

    by uweg ( 638726 ) on Tuesday May 03, 2005 @02:05PM (#12422456) Homepage
    Well, the problem starts with being born or getting up in the morning. And a system running since 20 years normally doesn't start to stink by itself.

    OTOH, what does "Risk management" in IT really mean, besides drawing nice PowerPoints and putting a chapter "Risk analysis" into change request forms, that are normally filled in with "No risk, no fun!" or "If I make a very big mistake, it will extinguish mankind"?

  • A game of Jenga (Score:5, Insightful)

    by lake2112 ( 748837 ) on Tuesday May 03, 2005 @02:06PM (#12422472)
    Unfortunately, it is commonly seen that upper management abides by the if-it-aint-broke, dont fix it mentality. With many systems there is a huge amount of pressure to fix bugs/ outstanding issues, once that is done they work on money-making initiatives. I see it as a game of Jenga. Pieces are removed from the bottom, to create a taller structure. Instead of reinforcing the base there is a constant push to make the tower taller until it comes crashing down.
  • by fm6 ( 162816 ) on Tuesday May 03, 2005 @02:34PM (#12422876) Homepage Journal
    The Slashdot headline is misleading (as usual). This is only incidentally about risk management. The real subject is, Legacy Applications -- and how you get rid of them before they bite you in the ass.

    As the article says, a lot of resistance to upgrades comes from employees who know how to do things a certain way, and won't retool without much screaming and kicking. I suspect that this is often the problem, and other problems -- distractions like strikes and the Y2K bug, managment that doesn't pay sufficient attention to the problem -- are just just secondary.

    Here's some personal experience that isn't nearly the same scale, but neatly illustrates what I mean. I once worked for a pubs department that delivered copy to printshops as raw Postscript. There was a push from management to upgrade to Acrobat-generated PDF. This should have been a no-brainer -- print shops hate dealing with raw Postscript, and the existing process relied on an ancient, unsupported printer driver that ran only on Windows 98. But the people who managed the process just totally balked, claiming that tight schedules left them no extra time to learn Acrobat. A lame excuse? Sure. But it took a new pubs manager, and escalation to the do-it-or-your-fired level, to get the chage made.

    I think this kind of issue had a lot to do with the failure of IBM's famous plan to use Unix or Linux for all their internal bureaucratic needs. Too many people dug in their heels, claiming that they couldn't possibly retool their Windows-based workflow.

    When you talk about this stuff, somebody always says, "If people can't get with the program, they should be fired!" Well, it often comes to that, as it almost did with the PDF issue. But you can't just abitrarily fire everybody who resists policy and process changes. It's expensive, there are legal ramifications -- and you risk destroying the very corporate infrastructure you're trying to save.

    • The trick is to determine whether or not the cost to converst a given system is actually worth it.

      If a rewrite effort requires 50,000 or 100,000 man years to complete, you're talking serious money...
      • You're hiding behind a silly quibble over the definition of the word "legacy". Nobody calls an application that rolled out last year a "legacy application". There may be a gray area, but there is clearly a lot of crap out there that's "legacy" in the worst sense of the word. Software that doesn't generate or use the kind of data that fits modern workflows. Software that only runs on ancient platforms that you keep around for the sole purpose of running them. Software that requires constant attention by agin
        • A former employer of mine hired a contractor to write a small system for them, and when it was done his contract was over so he left.

          The software was written in a modern language on a modern platform, but the employer did not have any of its own expertise in that language. Some of the folks there took shots at making small changes, but for the most part the thing was a black box.

          Was it a legacy application or not?

          My point: there's a HUGE grey area.

          Even the data supposedly "locked" on so-called legacy s
    • by Anonymous Coward
      "As the article says, a lot of resistance to upgrades comes from employees who know how to do things a certain way, and won't retool without much screaming and kicking."

      And why is that a bad thing? If the software is a good tool for the task at hand, they should keep using it. In fact, the article clearly says that this program was in many ways superior to newer programs on the market - which is why they didn't upgrade earlier. They say they were able to create good workflows based around the software -
  • Hmm... (Score:3, Interesting)

    by Greyfox ( 87712 ) on Tuesday May 03, 2005 @02:39PM (#12422975) Homepage Journal
    From what I can gather of the airline industry in general, it's a bunch of assorted systems that are sort of held together by duct tape and spit. If ever an industry needed open standards, mandated interoperability and thorough design and code auditing, I'd say that'd be the one. It seems to me that there really needs to be one central IT shop which rolls out all the software for airline and FAA IT needs and all airlines should go through that single central clearinghouse.
    • They've also historically had fairly large IT shops. That has given them a lot of time and manpower over the past four decades to write custom software for themselves, and that has resulted in many unique airline-specific systems, sometimes running on interesting combinations of hardware.

      One of the main problems with a "central IT shop" for the airlines is the fact that, operationally, each airline is somewhat unique in terms of the internal operational procedures they use, and many of the software applic
  • Old? (Score:3, Insightful)

    by Nemi ( 627009 ) on Tuesday May 03, 2005 @02:52PM (#12423200)
    Age of the software should make no difference. The problem in this particular case was that the system could only handle 32,000 transactions a month (the programmer obviously used the wrong data type). That could be a problem with software of any age. Age had nothing to do with it failing.

    This article rings more as a sales article than anything else - only it isn't selling anything. Which puts it squarely in the "wtf" category for me.

    • Age in this case is misleading. As you say, it could have happened with anything.

      What it should be emphasising is the importance of risk evaluation in the context of "disaster recovery". Had the business sat down to write a proper disaster recovery plan on the basis of "OK, what happens if this system goes completely kaput and all we have left are the offsite backups?" then it would have become clear that here was a business critical system which had no coherent DR plan.
  • by CatsupBoy ( 825578 ) on Tuesday May 03, 2005 @02:53PM (#12423229)
    Ok, the bottom line, they should have upgraded. Fine, we can all agree on that.

    Now, first the article states:
    [The application] was the only system left that ran on the airline's old IBM AIX platform (all other applications ran on HP Unix).
    First off, IBM AIX platform can be very new. Just because the application is old and possibly has bugs in it, doesnt mean the OS and hardware inst updated, or that HP Unix is any better.

    Secondly, the following scenario makes perfect business sense:
    SBS came in to make a pitch for its new Maestro crew management software [...] The existing crew management system wasn't exactly elegant, but all the business users had grown adept at operating it, and a great number of Comair's existing business processes had sprung from it.
    The article sets this up as the root of all thier problems. Good grief!!! dont waste resources on an inferior product for goodness sakes! If the product doesnt perform any better, and there are no known issues with the current product, forget it, its a waste of money.

    Then a series of unfortunate events lead to 4 more years of no funding for a replacement product. So what, the business is under a financial crunch, why go back and fix something that isnt broken (that they know of)? The business still needs to survive dont they? I'm guessing they maintained the hardware and OS, otherwise we'd be here talking about how stupid they were for not updating maintenance contracts.
  • by ehiris ( 214677 ) on Tuesday May 03, 2005 @02:58PM (#12423316) Homepage
    IT people know that technology will be obsolete in a short time but most business people always see technology as flashy cost reducers and they never plan on retiring the systems from the get-go. It's an annoyance but it is not suprising in an industry where duct taping old systems is preferred over structural improvements through architecture.
  • by EricTheGreen ( 223110 ) on Tuesday May 03, 2005 @03:53PM (#12424113) Homepage
    ...IMHO, can be found in the following single line from The Fine Article:


    But after nearly 15 years in use, the business had grown accustomed to the SBS system, and much of Comair's crew management business processes had grown directly out of it.

    (emphasis added)

    Talk about putting the cart in front of the horse. This system would never have been replaced before it's crash--the cost of readjusting process and any other attached technology would have dwarfed simply updating the software. There was no business case you could make that would appear to justify the expense. Other than the little matter of "your company won't function if something goes wrong", of course...

    Also, you'd never find a decent replacement product--since it's functionality would have to mirror those same system-driven business processes.

    The truly major oversight was in letting the package drive how Comair did this part of it's business in the first place. Done otherwise, the meltdown might still have happened, for plenty of reasons outlined in the article. But left this way, this result was pre-ordained. No amount of planning or "risk assessment" was going to counter the inertia created by this process/technology inversion.

  • Now if we can only tell the military that Ada is dead we'll be in business!

  • What SDLC model were they using for that application?

Solutions are obvious if one only has the optical power to observe them over the horizon. -- K.A. Arsdall

Working...