Forgot your password?
typodupeerror
Bug Media The Almighty Buck

How Would You Handle a $1,000,000 Coding Error? 878

Posted by timothy
from the wasn't-me dept.
theodp writes "The Chicago Tribune's efforts to upgrade its computer system over the weekend turned into a fiasco when the system crashed, halting all printing operations and leaving about half of the Trib's subscribers without papers. The software contained 'a coding error,' according to a spokesman who estimated the cost to resolve the problem at 'under $1 million.' Any advice for the poor schmuck who's going to get the blame?"
This discussion has been archived. No new comments can be posted.

How Would You Handle a $1,000,000 Coding Error?

Comments Filter:
  • Just one (Score:5, Funny)

    by kalidasa (577403) * on Monday July 19, 2004 @11:39PM (#9744849) Journal
    Check out this link [state.il.us]. Sorry, dude. Any of us could have done it.
    • Re:Just one (Score:4, Funny)

      by RTPMatt (468649) on Tuesday July 20, 2004 @12:03AM (#9745103) Homepage
      Any advice for the poor schmuck who's going to get the blame?

      Dod\/ge
      |_______________________________________>schmuc k

    • McDonald's (Score:5, Funny)

      by drwtsn32 (674346) on Tuesday July 20, 2004 @12:04AM (#9745112)
      They're always hiring. And if you screw up a burger, it only costs the company about $0.17.
    • by 0x0d0a (568518) on Tuesday July 20, 2004 @12:11AM (#9745178) Journal
      You insufferable ass -- you just slashdotted Illinois.
    • Re:Just one (Score:5, Funny)

      by kegwell (789687) on Tuesday July 20, 2004 @01:18AM (#9745600)
      Ehh..hopefully he lives on the side of town that the paper will get delivered on because he will definitely need the classifieds to look for a new job.
    • Tribune's version (Score:5, Informative)

      by Anonymous Coward on Tuesday July 20, 2004 @01:46AM (#9745729)
      Here is the full text of the article in the Tribune:

      A story we never thought we'd print

      By James Coates
      Tribune computer columnist
      Published July 19, 2004, 6:40 PM CDT

      Nothing built by humans can go wrong in as many ways or with as nasty an outcome as a computer system.

      The people who create the Chicago Tribune started relearning that fact about 4 p.m. Sunday when they noticed that nothing was getting through as they attempted to beam the stories, artwork and ads from Tribune Tower to the Freedom Center printing plant.

      About 13 hours later, they finally started printing a 24-page version of Monday's Tribune that should have already been landing on their readers' porches.

      It was a misfortune that most people in the news business don't ever expect to experience. Newspapers do not miss days -- and Monday was close.

      The only time the Tribune failed to print was during the Great Chicago Fire of 1871. That time, the lesson was that nature can be fickle and dangerous.

      Now, the paper has learned that the same goes for the computer technology that has graced the industry with unparalleled productivity since the 1990s.

      Business computer systems are cobbled together as row upon row of workstations, each running an operating system based on an estimated 50 million lines of instructions. In turn, the worker bee desktop computers connect to the queen machines with their own millions of lines of code in a different language.

      An endless nest of wires, cables and even radio signals move instructions at light speed between the central computer and the workstations. The main computer also talks to all the peripheral devices needed to accomplish the mission.

      The peripherals can be banks of hard drives, storage bays, printers, scanners, cameras and specialty devices as diverse as a pager or a printing press several stories tall.

      The certainty that each and every one of these massively complex systems will crash haunts the people charged with keeping this thoroughly digital world up and running.

      Those people are engineers, and so they often reduce it to numbers.

      An often quoted study by Carnegie Mellon University computer scientists studied 30,000 software programs and found five to six defects per 1,000 lines of code.

      And this is for finished software sent to customers.

      When writing new programs, there is typically a defect in every 10 lines of code. About a half dozen defects per 1,000 lines remain after a process of checking, rechecking, cross checking, testing, retesting and finger crossing.

      The hubris of computing becomes clear as one realizes that each of these errors in code branch out with instructions to millions of other lines of code. Quite often, they find pathways never before taken by that particular program.

      Collisions occur on these pathways and trouble is spotted. Maybe it can be fixed or maybe technicians can only perform a "workaround" that can't be guaranteed.

      Dick Malone, the Tribune's senior vice president and general manager, said that around 9:30 a.m. on Sunday technology crews started a planned upgrade to increase the newspaper's Sun Microsystems servers from so-called 10K models to 15K machines.

      To do this, experts from the company that makes the newspaper's core Windows-based publishing software, Denmark-based CCI Europe A/S, needed to install upgrades of its Newsdesk brand software that the Tribune and other clients use.

      Malone noted that they checked and rechecked, tested and retested all day. Everything seemed to be working without a hitch. Then, they punched the button that was supposed to send all of the content for the newspaper to the printing plant.

      Nothing arrived.

      Frantic hours went by as deadline after deadline slipped while crews struggled to find a fix. Malone said he went so far as to start setting up the newspaper's pages on the art department's Macintosh desktops, hoping to get at least something printed.
  • by mfh (56) on Monday July 19, 2004 @11:39PM (#9744850) Journal
    > How Would You Handle a $1,000,000 Coding Error?

    I would have to follow Dogbert's Top Secret Management Handbook [amazon.com], and take full responsibility for the bungle. That way when the next job comes up two or three rungs above me, I'll be at the top of the list of people with actual experience with massive projects, and it won't matter that it was a colossal screw-up because I will have jumped two or three pay-grades. Corporate fall-guys, if they take it right, always end up better off than quiet behind the scenes types.

    So my advice is that you should take full responsiblity and sharpen that resume, but be sure to make it known that you have learned from your mistakes and you worked hard to correct them. Nobody gets anywhere without making big blunders along the way. Be a good sport and you'll jump at least two pay grades for this blunder.
    • by Pulse_Instance (698417) on Monday July 19, 2004 @11:46PM (#9744940)
      In my experience being honest about your mistakes and having the willingness to learn from them always pays off.
      • by Black Parrot (19622) on Tuesday July 20, 2004 @02:24AM (#9745900)


        > In my experience being honest about your mistakes and having the willingness to learn from them always pays off.

        Yes, they'll just pull the lever that instantly drops your seat into the pool of piranhas, skipping those inconvenient steps where they would have to torture a confession out of you first.

    • by MrDelSarto (95771) on Tuesday July 20, 2004 @12:39AM (#9745391) Homepage
      Reminds me of that often quoted story about Thomas Watson, head of IBM, when some executive made a bad decision that ended up costing $10 million. The guy comes in and says "I suppose you'll want my resignation now" and Watson replies something like "Are you crazy! I just spent $10 million educating you!"
    • by raehl (609729) * <raehl311@[ ]oo.com ['yah' in gap]> on Tuesday July 20, 2004 @12:44AM (#9745427) Homepage
      Bad news: We missed printing half of our papers.

      Good news: Rainforest saved.
      • by mc6809e (214243) on Tuesday July 20, 2004 @01:06AM (#9745540)
        Bad news: We missed printing half of our papers.

        Good news: Rainforest saved.


        Actually, most of the wood pulp comes from trees grown in managed forests where trees are replanted to replace the old ones.

        So it's a bit like growing corn or wheat to eat.

        Strangely, we don't see many people shouting "save the corn!".

        • by killjoe (766577) on Tuesday July 20, 2004 @02:10AM (#9745812)
          Actually that's not quite true. The big paper companies do have large forests that they try to manage but they cut trees much faster then they are being replenished. This is why there is relentless pressure to log the national forests. If the harvest from private acreage was sustainable they would never need to log the national forests.

          These days companies like champion and plum creek are finding that it's more profitable to sell the logged areas then to replant them. For example in maine [maineenvironment.org] and montana [seeleyswanpathfinder.com].

          It's more profitable to sell land (especially waterfront land) and then log the federally subsidized national forests.

          Your tax dollars at work!
          • by cheekyboy (598084) on Tuesday July 20, 2004 @02:14AM (#9745838) Homepage Journal
            If they had a clue, they would grow 10000 acres of canabis which;

            A) grows 10000x faster than trees
            B) makes 10x more pulp per acre
            C) uses 100x less water.
            D) stick it to the govt.

            But would they ever do that? NOOOO coz there are no patents in the process to expoit and oh the trouble of the govt wackos like bush n old guys being so anti-canabis (to protect their buddies profits)

            I guess they wouldnt want 100s of pot heads heading up to the 100000s acres of weed to take a few home, but what is so wrong with that OTH?

            • by veg_all (22581) on Tuesday July 20, 2004 @02:42AM (#9745994)
              If they had a clue, they would grow 10000 acres of canabis which;

              A) grows 10000x faster than trees
              B) makes 10x more pulp per acre
              C) uses 100x less water.
              D) stick it to the govt.


              I think you forgot your "...profit" clause, except here it would say
              D) Use a bunch of arguments of dubious value to misdirect attention from the fact that what you really want is to get stoned
              ...Profit!
              • by aastanna (689180) on Tuesday July 20, 2004 @08:06AM (#9747191)
                Not to rain on your parade, but you won't get stoned off that. [gov.ab.ca]

                "Fibre hemp is an annual herbaceous plant which flourishes in temperate regions. All cultivars tested in Alberta have been low-THC (delta-9 tetrahydrocannabinol) cultivars. Canada has adopted the 0.3% THC standard established by the European Union as the concentration which separates non-psychoactive strains suitable for legal fibre production from those which are illegally grown for their properties of intoxication. The 0.3% THC designation is very conservative. Most narcotic strains range from 3-5% THC, with cleaned, high potency material reaching as high as 15% THC."
            • You forgot... (Score:5, Informative)

              by Kevin Burtch (13372) on Tuesday July 20, 2004 @03:28AM (#9746211)

              E) Pulp does not require hardly any bleaching or even a tiny fraction of the toxic chemicals wood-pulp requires to process.

              E.1) No toxic chemicals to expensively dispose of (less pollution).

              F) Pulp requires a fraction of the processing compared to wood-pulp.

              G) Same (non-THC-producing) hemp grown for rope and clothing can be used... existing/established farming methods.

              H) Requires _much_ less fertile ground (no fertilizer) for growing... technically it _is_ a weed (not just a nickname).

              H.1) ...as a side-effect to H, will grow in much less expensive land. Heck, add water and it'll grow in a desert.

              I) Requires much less expensive processing equipment to farm (ground requires drastically less/no tilling, collection can be done with hay-baling equipment instead of heavy trucks and tree-cutting machinery, etc.).

              I'm sure I'm forgetting some.

              Note the reference to a non-THC-producing strain... I'm not into pot, but I certainly can see a phenomenal idea when I see one (seen this one many years ago).

              • Re:You forgot... (Score:5, Informative)

                by Tony Hoyle (11698) <tmh@nodomain.org> on Tuesday July 20, 2004 @05:46AM (#9746634) Homepage
                AFAIK it's mostly down to the paper industry that hemp is illegal anyway... they wanted to produce their more lucrative wood based paper (it's difficult to make a profit when your raw materials are a weed that grows anywhere very quickly.. better to standardise on a limited resource that takes 30 years to grow). Lobbyists were very powerful in the US even 50 years ago.

                The US actually managed to eradicate a weed that grew on the roadside from their shores by agressive burning along with a demonisation campaign to try to turn people off the (then popular) drug... a bit like the 'war on terror' but with even fewer facts behind it :)

                There are many strands of non-THC containing Hemp (given that the social effects of introducing wide availability of another drug are undesirable - alchohol is bad enough). In Europe at least there are fields full of the stuff, as hemp rope and linen is still very popular. Even hemp paper is available, given it's cheap/easy to produce...

                Medical Hemp (the THC kind) is grown under license and given to selected patients to treat certain conditions, although that's mostly still under trial (and is the motivation for the reclassification of cannabis possesion in the UK, so that the drug companies could legally do their trials).
                • Re:You forgot... (Score:4, Insightful)

                  by jovetoo (629494) on Tuesday July 20, 2004 @07:49AM (#9747117) Journal

                  My experiences tell me cannabis is a much more desirable drug than alcohol, both from the users and society's point of view.

                  Use both drugs with some sense and nothing bad will happen. Overdo alcohol and it will make you loud and often aggressive. Overdo cannabis and you will fall asleep (which can be loud but seldon aggressive). Neither are very suitable for driving. (Although I prefer people who smoked over people who drank: they drive more relaxed.)

                  It is when talking addiction that the large difference arises. Alcohol is a hard drug, you get physically addicted, cannabis is not. Alcohol demolishes you while it degrades you. Cannabis use over large timeperiods is claimed to deteriorate memory. (So, don't drink to forget, smoke! ;) If you smoke the cannabis (instead of eating it) you get the same risks as with tabacco use.

                  Here in Belgium, cannabis is more or less legal now (we are allowed to carry upto 3.3 grams on the street and use it in private places and such). It is a good thing, because we did that anyway (I live about 40 kilometer from the closest cannabis shop in the Netherlands where I can buy as much as I like legally).

                  There were no sudden changes in behaviour. No millions extra addicts, no stepping stones, nothing. The people who are inclined to (ab)use drugs usually do not care about legality.

              • by fataugie (89032) on Tuesday July 20, 2004 @07:38AM (#9747059) Homepage
                I'm sure I'm forgetting some.

                I bet I know why....

        • by Himring (646324) on Tuesday July 20, 2004 @08:37AM (#9747410) Homepage Journal
          Actually, most of the wood pulp comes from trees grown in managed forests where trees are replanted to replace the old ones. So it's a bit like growing corn or wheat to eat.

          You couldn't be more wrong. I live near a large paper mill that produces products for news paper companies. I've lived here all my life. I've seen first hand how they rape the forests, the mountains, etc. Sure, they plant yellow pine because yellow pine grows fast and fits their purposes, but where they plant the yellow pine was once a lush hardware forest of oaks, maples, etc. They take out the large hardwoods that provide acorns for deer and other small animals and replace them with pine, so now the pines grow unabated. The animal populations suffers. Also, any smaller hardwoods they cannot use they slash or poison so it will die. Next, since there are so many pines we recently had a plague of pine beetles. Huge tracts of pine forest (man-made pine forests) lay in waste in the mountains, hills and along the highways here. This is partly the fault of the paper company. Also, the chemicals they use creates an artificial/chemical fog that wreaks havoc. I kid you not. We had one of the largest traffic accidents in US history here some years back where 100s of cars piled up on I75. It made national news. I think the paper company paid off the victims families nicely enough though. Finally, the workers in this mill are exposed to harmful chemicals such as chlorine that takes a toll over time. Usually, late in life there are massive respiratory problems.

          It's easy to arm-chair quater-back where your news paper comes from, but I for one don't subscribe to anything but online sources. You should too....
    • by mdrejhon (203654) * on Tuesday July 20, 2004 @01:55AM (#9745759) Homepage
      History....one line coding error [soft.com] cost $60 million dollars!

      AT&T Failure of January 15, 1990


      Link 1 [google.ca], Link 2 [berkeley.edu], Link 3 [soft.com]

      On January 15, 1990, 114 switching nodes of the AT&T long distance system went down. The published cause of the crash was a bug in the failure recovery code of the switches. When a node crashed, it sent "out of service" message to the neighboring nodes, which are supposed to re-route traffic around it. However, the bug (a misplaced "break" statement in C code) caused the neighboring nodes to crash themselves upon receiving the "out of service" message, and further propagate the fault by sending an "out of service" message to nodes further out in the network.

      The crash lasted 9 hours, while programmers searched for the cause of the bug. An estimated 60 thousand people were left without telephone service, and 70 million phone calls went uncompleted. AT&T estimates at least $60 million in lost revenue and damage to its reputation; reliability was a central point in AT&T's marketing campaign against other long distance providers at the time. The incidental damage to businesses that were unable to operate due to lack of telephone service is hard to estimate, but is presumably much larger. The public safety and national security implications of such a large telephone system outage are distressing as well.

      This fault happened despite fault-tolerant design principles which were present in the phone system's design. The nodes failed fast, reporting their outage to neighboring nodes, and there was enough redundancy in the system to route around the failures. The crashed nodes recovered quickly, rebooting themselves and coming back up; however, they would immediately crash because of the messages received from neighboring nodes. The failure happened on an error-recovery path, which is poorly tested. The presence of decentralized distributed control, necessary for scaling, allowed this failure to propagate. The outage demonstrates that a bug in the software can cause a widely correlated failure.

      The possibility of a malicious attack on the system was seriously investigated as a cause for the crash. The investigation came up dry, but most sources acknowledge that this accidental fault could have just as easily been activated on purpose by a knowledgeable attacker. The social implications are investigated in detail in Bruce Sterling's The Hacker Crackdown.
  • The scoop (Score:3, Funny)

    by SIGALRM (784769) * on Monday July 19, 2004 @11:39PM (#9744851) Journal
    Any advice for the poor schmuck who's going to get the blame?
    Yeah... you shouldn't have written:
    char buf[8];
    printf ( "Hey, what's the scoop, newsboy? " );
    gets ( buf );
    printf ( "Good one my boy, now off to the presses to publish %s!!\n", buf );

    (It pays to use Splint [splint.org])
  • by Jad LaFields (607990) on Monday July 19, 2004 @11:40PM (#9744857)
    ... and blame it on Microsoft.
    • Funny you should mention that. According to the Chicago Tribune [chicagotribune.com](subscribtion required),

      ...technology crews started a planned upgrade to increase the newspaper's Sun Microsystems servers from so-called 10K models to 15K machines. To do this, experts from the company that makes the newspaper's core Windows-based publishing software, Denmark-based CCI Europe A/S, needed to install upgrades of its Newsdesk brand software that the Tribune and other clients use.

      So was it Sun or Microsoft?? Or maybe Apple?

      Frantic hours went by as deadline after deadline slipped while crews struggled to find a fix. Malone said he went so far as to start setting up the newspaper's pages on the art department's Macintosh desktops, hoping to get at least something printed.
      • by TheRaven64 (641858) on Tuesday July 20, 2004 @09:17AM (#9747746) Journal
        Sounds simple to me. We don't like Sun this week, and we never like MS (well, we liked them briefly when they released the X-Box. And maybe some other times. But mostly we don't like them.*), so we can blame both MS and Sun (although MS more, because we like them less). We still like Apple, so we don't blame them at all. Except last week when we were mad at them for the whole Dashboard thing. But we like them again now (I think). Anyway, from your second quote, it sounds like the Macs were the only thing still working, so we can probably justify not blaming Apple.

        This post has been approved by the Slashdot Ministry of Truth.

  • My advice. (Score:3, Funny)

    by Anonymous Coward on Monday July 19, 2004 @11:40PM (#9744858)
    Time for plan B [sdmcdonalds.com]
  • by Gldm (600518) on Monday July 19, 2004 @11:40PM (#9744866)
    Just have each of their coders chip in a dollar, problem solved.

    *ducks*
  • by Anonymous Coward on Monday July 19, 2004 @11:40PM (#9744872)
    Anyone else think it was poor 'theodp' ??!
  • by Fubar420 (701126) on Monday July 19, 2004 @11:41PM (#9744878)
    Well, ok so that might not fly, but hey, it works when its true if you work for a modestly forgiving employer...

    Now if the cause was insufficient testing, well then QA has to answer for it.

    And if there's no QA, well that's managements fault...

    Now if it all comes down to dumb circumstances, it's poor planning on the papers fault for not testing themselves ;-)

    That said, fess up, worse comes to worse, you now have national infamy, and any fame is good fame, right??
    • by Soko (17987) on Tuesday July 20, 2004 @03:07AM (#9746115) Homepage
      I'm giving up moderation on this story to post this, so listen the fuck up.

      I work in newspapers, and have for the past 7 years. The blame for this fiasco should be pinned directly on the project manager. Not the coders, not the people trying to get the thing running, but the project manager. Right in the middle of his fucking forehead.

      I've torn the guts out of many newpaper networks upgrading or improving them, but never have I ever put anyone in the position of "If the new system doesn't work, we're fucked." I've always made ab-so-fucking-loutely certain there was a fall back position where the paper would hit the press. I actually had this conversation before:

      <Management weenie> What happens if this new server fails?
      <me> I haven't touched the old server. If the new one hiccups one whit, we fire up the old box and produce product.
      <Management weenie> I don't like that - we've spent a million bucks on the new gear. Delays make me look bad.
      <me> Well, if you're willing to man the phones when the advertisers call demanding re-prints of thier ads because of human error somewhere, I have no problem with it.
      <Management weenie> You're an asshole. I could have you fired.
      <me> In this instance, I'm paid to be an asshole. You can't fire me for doing my job.
      <Management weenie> Heh. OK, we'll go with your plan.

      Not planning some way to get the paper on the press is dereliction of duty, and deserves your professional head to be lopped off.

      Is there _no_ professionalism anymore? Fuck, I should be paid more. Morons like that burn me - when you blow up a critical system with no backup, it's not just your livelyhood, but for everyone who depends on that system functioning as needed - it's thier livelyhood as well. Fucking morons.

      Soko
      • by Rich0 (548339) on Tuesday July 20, 2004 @07:38AM (#9747061) Homepage
        This is simply the result of blind cost-cutting.

        "We need to reduce spending in non-core areas!" IT usually ends up being defined as non-core (unless you're an IT company).

        Suddenly management questions you if you want to buy so much as a network hub (el-cheapo consumer grade at that - not for infrastructure). You have to justify any expenditure, and so the guys on the bottom just stop asking since it is such a pain.

        I'm sure anybody on that failed project could have identified steps that would have yielded a fallback. They could have built a new server, and then switched it out with the old server and kept the old one ready to go in an emergency for a couple of weeks. But that would require a $2000 server requisition - or maybe $3000 since the corporate standard was picked by some idiot on the vendor's kickback list.

        For the guy on the bottom, they look bad for asking for money, and chances are that the fix would have worked fine with no failsafes at all - the last 15 upgrades probably did. He has to ask for money each time, and will have nothing to show for it.

        On the other hand, every person on that project was probably thinking the same thing. Sure, spending $2k is a good business decision, but upper management wouldn't recognize that, so let's just not ask. We won't point out how much we're saving on server hardware by not having backups - we'll just let our overall expenses speak for themselves and not call attention to our negligence. And then we'll get promoted year after year and if something goes wrong we just all look dumb and nobody understands computers anyway so management will just figure that these costs come up any time you use one.

        And you know what? This approach usually works in the end.

        The real responsible party is the one which made cost-cutting-at-any-cost the corporate line. Oh, sure, the corporate policies usually have exception clauses, but what bottom-rung employee is going to bother running a request 12 links of the chain of command just to spend an extra $1000 on hardware? The opportunity to use it would pass before it ever got approved.

        The problem is the question-everything approach of corporate fiduciary management. Sure, there is waste out there, but it doesn't take many botched migrations to drarf what you save by pinching pennies...
  • by Mmm coffee (679570) on Monday July 19, 2004 @11:41PM (#9744883) Journal
    I would go out, and get so absofreakinlutely drunk that I wouldn't be able to remember my middle name, let alone that I made a $1M error. And then when the lawsuits are about to go to court and I started showing signs of severe alcoholism, I would put my head inbetween my legs and kiss my ass goodbye. 'Cause man, that would really suck.

    Well, you asked.
  • Advice? (Score:3, Funny)

    by quantaman (517394) on Monday July 19, 2004 @11:42PM (#9744885)
    Any advice for the poor schmuck who's going to get the blame?

    Well my first advice is to come clean, yes I mean you theodp, I think we all know who this poor schmuck is ;)
  • Testing? (Score:5, Insightful)

    by buff_pilot (221119) on Monday July 19, 2004 @11:42PM (#9744891) Homepage
    Where was the pre-install testing?

    A good test should have identified some errors, especially if it blew up IMMEDIATELY.
    • Deployment? (Score:5, Insightful)

      by BiggerIsBetter (682164) on Monday July 19, 2004 @11:52PM (#9744999)
      Where was the phased or parallel deployment?

      You don't just change a system like in a weekend. There WILL be problems, so you have to have ways of dealing with it. Maybe that means flicking the switch back to the old system if it fails, or maybe it means running with degraded capacity a while, but whatever it is, it's dead-in-the-water is not your Plan B.
      • Re:Deployment? (Score:5, Insightful)

        by mec (14700) <mec@shout.net> on Tuesday July 20, 2004 @01:48AM (#9745737) Journal
        Where was the phased or parallel deployment?

        Probably in the hands of someone who decided:

        (1) Cost of catastrophe: $1,000,000.
        (2) Chance of catastrophe: 5%
        (3) Cost of setting up parallel system, including hardware, software licenses, system administration: $250,000.

        If (1) times (2) is less than (3), then it's actually better not to spend the money on (3).

        Of course you can argue with the actual numbers in (1) (2) (3). (1) is the Tribune's own estimate. (2) is estimable by looking at the history of past projects, I'm just guessing 5%. And I just pulled (3) out of the air.

        That said, I bet they do have degraded capacity, and that they used it to print half their papers on Monday and all their papers on Tuesday.
    • planning? (Score:5, Insightful)

      by twitter (104583) on Monday July 19, 2004 @11:56PM (#9745027) Homepage Journal
      A good test should have identified some errors, especially if it blew up IMMEDIATELY.

      Good planning would have had an abort procedure, so the show would go on. Everything changed should be undone if it did not work. They could figure it out after the paper was printed.

      Errors are inevitable. Good planning and implementation keep you from falling on your face even when you publish seven days a week. It's not the coder's fault.

      • Re:planning? (Score:5, Interesting)

        by geekoid (135745) <dadinportlandNO@SPAMyahoo.com> on Tuesday July 20, 2004 @03:18AM (#9746172) Homepage Journal
        exactly.
        I can not count they number of battles I have fought just to get some time to design an emergency rollback plan.
        I wish I had more balls to jump up in a emergency meeting and sream "I TOLD YOU TO GIVE ME A FEW DAYS SO I COULD DESGIN A ROLLBACK PLAN, ASSHOLE. BUT NOW ALL THE DATA CORRUPTED, AND WE CAN'T DO ANYTHING ABOUT IT BECASUE OF YOU!!"

        Instead, I just keep a copy of the emails where I made the request and was denied, and then forward them to the CTO.
    • Re:Testing? (Score:5, Insightful)

      by ryen (684684) on Monday July 19, 2004 @11:58PM (#9745058)
      I agree.
      Blame the project manager (hopefully their was one) that led testing the services thoroughly before deployment. Individual coders shouldn't be held to any legal liability.
      Any legal action should be directed towards the'outside provider' (as noted in the article).
    • by Bill, Shooter of Bul (629286) on Tuesday July 20, 2004 @12:59AM (#9745512) Journal
      I'm willing to bet there will be an opening for IT manager soon.
  • Re-engineer (Score:3, Funny)

    by OmegaGeek (586893) <robwall@gmail.RABBITcom minus herbivore> on Monday July 19, 2004 @11:43PM (#9744893) Homepage
    That isn't a bug - its a feature!
  • Uptime (Score:4, Funny)

    by FiberOpPraise (607416) on Monday July 19, 2004 @11:43PM (#9744904) Homepage
    23:44:03 up 48545 days, 6:15, 1 user, load average: 0.00, 0.00, 0.00 Blink. up 0 days, 1:00, 1 user, load average: 0.00, 0.00, 0.00 I hope they got a SS of that massive uptime.
  • by justanyone (308934) on Monday July 19, 2004 @11:45PM (#9744919) Homepage Journal
    I didn't get my paper this morning and was angry until I read this.

    I'm not angry anymore, I'm sympathetic for the poor schmuck as well as all the customer service people who probably got yelled at this morning.

    -- Kevin J. Rice
  • by Anonymous Coward on Monday July 19, 2004 @11:45PM (#9744929)
    Management frequently makes mistakes which cost much more. The difference is that their mistakes are not as easily identified or attributed to a single person.

    The culprit should just admit it. Shit happens, it's unavoidable even if you take all precautions. Don't make the same mistake again, though.
  • by ejaw5 (570071) on Monday July 19, 2004 @11:47PM (#9744944)
    LIMITED LIABILITY
    Software provided as-is. Softare developer/company is not liable for any physical, financial, or any other loss or damage arising from use of software.

    Doesn't all software come with things like this? (nevertheless, thank-goodness I'm not a software developer)
  • My advice (Score:5, Funny)

    by baywulf (214371) on Monday July 19, 2004 @11:48PM (#9744954)
    "Any advice for the poor schmuck who's going to get the blame?"

    My advice: Prepare three envelopes
    • Re:My advice (Score:5, Informative)

      by harikiri (211017) on Tuesday July 20, 2004 @12:49AM (#9745457)
      If you're referring to the quote from Traffic [imdb.com] - the quote in full refers to two letters (not three):

      GENERAL LANDRY
      When Kruschev was forced out, he sat
      down and wrote two letters and handed
      them to his successor. He said "When
      you get into a situation you can't
      get out of, open the first letter
      and you'll be saved. And when you
      get into another situation you can't
      get out of, open the second." Soon
      enough this guy found himself in a
      tight place. So he opened the first
      letter. It said, "Blame everything
      on me." So he blamed the old guy
      and it worked like a charm.
      (beat)
      He got into another situation he
      couldn't get out of, so he opened
      the second letter, which read, "Sit
      down and write two letters."

      They stare at each other a beat. Then Landry smiles.
  • by C60 (546704) * <salad@carbon3.1415960.net minus pi> on Monday July 19, 2004 @11:48PM (#9744958) Homepage
    Change your name, and switch to a "skills" based resume rather than an experience based one...
  • by WarMonkey (721558) on Monday July 19, 2004 @11:48PM (#9744961)
    And this is why you don't use an Access database for a job like this.
  • by Jayfar (630313) on Monday July 19, 2004 @11:49PM (#9744964)
    Any advice for the poor schmuck who's going to get the blame?

    Down, not across. (motto of alt.sysadmin.recovery referring to best method of slashing one's wrists).

  • by herrvinny (698679) on Monday July 19, 2004 @11:50PM (#9744971)
  • by David Frankenstein (21337) on Monday July 19, 2004 @11:50PM (#9744977)
    With any large roll out, if only one person is at fault for a fiasco like this, then the project mas mismanaged. They should have had a plan in place to backout the change.
  • by multipartmixed (163409) * on Monday July 19, 2004 @11:52PM (#9744994) Homepage
    Well, if I was in management.. I would find the programmer responsible, and have him snipped!
  • Fix it. (Score:5, Interesting)

    by wideBlueSkies (618979) * on Monday July 19, 2004 @11:52PM (#9744996) Journal
    Simple enough.

    Take responsibility and ownership of the problem. Don't make excuses, but give real reasons.

    Fix it..do whatever it takes, even if it means working over a weekend.

    Write a good post mortem, explaining how th e fix is different from the original problem.

    And hope to god that your management is understanding enough to keep you on.

    This is comong from a guy, who in 1997 blew a $100,000 test weekend by kicking off the systems tests by loading the wrong generation of tapes.

    I took the blame, and expected to lose my job. But I knew that the right thing to do was to try to recover from the problem. I stayed in the office from 1:00AM Sunday to 10:00AM Monday morning rerunning every job and report and proving out the results.

    Not only did I keep my job, but I got promoted a year later. I made a name for myself that weekend....sure I could f*k up, but I work hard to keep things right for the company.

    wbs.
  • I've had coworkers who made major bugs that crashed servers and workstations and caused a lot of downtime. This is because they wrote sloppy code in a hurry and never bothered to check it. Management usually wants faster turnaround time on projects.

    So your choices:

    Plan A: Blame managers for forcing you to work under stressful conditions that lead to a workplace hazard (stress) that caused you to make the error. Cite that you had to work a lot of overtime and the lack of breaks and sleep caused you to miss a major bug.

    Plan B: find someone like me who takes their time coding and have them look over the code and fix the problem for you. Sometimes another pair of eyes helps to find things you've missed.

    Plan C:
    Go to work in flip-flops, a Hawaiian shirt, sunlasses and tell everyone you are on vacation. Make Pacman noises, and talk to your invisible friends. Claim insanity and see if that works.

    Plan D:
    Start looking for another job ASAP.
  • UAT/QA anyone? (Score:5, Insightful)

    by bwy (726112) on Monday July 19, 2004 @11:55PM (#9745026)
    One of the benefits of working for a big company is a QA/UAT department. You have an entire department of people lined up just to test your shit. And, usually this type of job makes a person very anal. They log defects for just about everything.

    The person writing the code can unit test to his or her best ability, but it is really the job of someone else to put it through the wringer testing thousands of simulated real-world scenerios. Sure, a coder could do this testing. But a QA guy or gal is doing really well if he makes 3/4 the salary of the guy who wrote the code- so a divison of labor only makes sense.

    Not to mention the person writing the code makes the worst tester in the world. You only test it the way you THOUGHT people would use it. So, while a coder is perhaps the one who created the original problem, the real fault is in whoever let this slip through to production. Assuming, of course, that it wasn't some kind of time-bomb easter egg that would have been impossible to test. Although, good QA testers should alter their system date/time when testing date sensitive routines.
  • My advice (Score:5, Funny)

    by GISGEOLOGYGEEK (708023) on Monday July 19, 2004 @11:56PM (#9745029)
    Send the coder to the Open Source world because no one is going to pay him to code anymore.

    And send his supervisor too for not testing the system properly before trying to roll it out.
  • "angry or confused" (Score:3, Informative)

    by Kris_J (10111) * on Monday July 19, 2004 @11:57PM (#9745039) Journal
    By mid-day, the paper had received more than 40,000 phone calls from angry or confused subscribers.
    My, some people get worked up easily. I bet there was a message on the automated phone system explaining that there had been a technical error and some papers hadn't been delivered. I can't imagine I would have needed to lodge a complaint, speak to a human or get angry.

    Mind you, here in Perth we only have one daily newspaper and it sucks, so I can't imagine getting worked up about a failed delivery.

  • by YouHaveSnail (202852) on Monday July 19, 2004 @11:57PM (#9745040)
    How Would You Handle a $1,000,000 Coding Error?

    Frankly, I can't believe anyone would pay $1M for a coding error. Hell, the guys I work with make coding errors all the time, and practically for free!

    (That's free, as in beer.)
  • by tftp (111690) on Monday July 19, 2004 @11:57PM (#9745041) Homepage
    I don't think any specific programmer will be blamed for that, and I don't think the phrase "coding error" really reflects what happened. It's more likely just a popular explanation like "the computer crashed".

    Noone [in their right mind] orders a brand new paper publishing system from a single consultant. The software probably was priced in several million dollars. Somewhere between the components something broke. For example, the file format that the publisher produced was rev. 2.1, but the software at the presses side was only aware of rev. 1.7 and below... If the coder only tested his code with the "other" piece of latest revision, he would never see any problem; and it is not his guilt that in real life the real customer uses some obsolete stuff that isn't compatible...

    This kind of problem is clearly of administrative nature, of a system design and of checking which pieces work with which other pieces. Clearly, blame should be assigned to non-existent QA procedures, insufficient unit testing and [obviously] inadequate integration of components. The coder is nowhere here, it's all system design and QA stuff, realm of managers.

  • by John Whorfin (19968) on Monday July 19, 2004 @11:57PM (#9745042) Homepage
    I'm a programmer for a large, (US) national newspaper chain and screwing up the publication cycle is somewhat more common that you might think.

    Most daily newspapers produce various editions, between 2 and four, and I've seen a couple of times, where only one edition is printed due to "codeing errors" (like the 1 billion seconds from the epoc thing - my personal favorite).

    Of course the vendor had to be called at the $500/hour emergency rate to fix their own error.

    Once I saw a print pre-processor go off line because /dev/null was deleted and the backup systme had been down for 6 mos. and take out $50,000 - $100,000 in advertising.

    The call daily newspapers "the daily miracle" and when you look at some of the computer band-aids they have producing them, you can see why.
    • by prockcore (543967) on Tuesday July 20, 2004 @01:25AM (#9745639)
      I'm a programmer for a large, (US) national newspaper chain and screwing up the publication cycle is somewhat more common that you might think

      We had a reporter screw up and drag a folder into the trash instead of the volume it was in (MacOS is absofuckinglutely retarded for having you unmount volumes by dragging them to the trash).

      He went on with his business, and then around 5pm he emptied the trash. He suspected something was wrong when it was taking over 5 minutes to empty the trash.

      Turns out the folder he trashed contained *all* the quark documents for the paper (the next day's stories and advance stories).

      While there were backups, some people had to scramble to rewrite their stories. Paper was a little light the next day.

      That's the problem with OS9 and OSX. The users need permission to delete stories in order to have permission to modify stories.
      • by AnEmbodiedMind (612071) on Tuesday July 20, 2004 @06:57AM (#9746897)
        From the OS X man page for "sticky (8)":

        NAME
        sticky - sticky text and append-only directories

        DESCRIPTION
        A special file mode, called the sticky bit (mode S_ISVTX), is used to
        indicate special treatment for shareable executable files and directo-
        ries. See chmod(2) or the file /usr/include/sys/stat.h for an explana-
        tion of file modes.

        STICKY DIRECTORIES
        A directory whose `sticky bit' is set becomes an append-only directory,
        or, more accurately, a directory in which the deletion of files is
        restricted. A file in a sticky directory may only be removed or renamed
        by a user if the user has write permission for the directory and the user
        is the owner of the file, the owner of the directory, or the super-user.
        This feature is usefully applied to directories such as /tmp which must
        be publicly writable but should deny users the license to arbitrarily
        delete or rename each others' files.

        Any user may create a sticky directory. See chmod(1) for details about
        modifying file modes.
  • by TheTXLibra (781128) on Tuesday July 20, 2004 @12:02AM (#9745092) Homepage Journal
    True story. I was working an assignment as a tester for Microsoft. I apologize for the use of variables, rather than names, but I don't want to get sued for breaking NDL. There was a deadline on the release, and if we missed it, there was a penalty of $1 per copy shipped. 20 million copies were due to be shipped on date X. The day of date "X", we realize there's a fatal bug that causes Product "Y" to crash after running any segment that lasts longer than "Z" minutes. Somehow, I'd completely missed this bug. I have no idea how, don't ask, but I completely missed it. We even checked back 3 months worth of revs...the bug was sill there in each one. Of course, the product was late, costing Microsoft a whopping $20 million. What did I do?

    I was "allowed" to resigned gracefully, quietly, and have learned a valuable lesson about software testing: It's not whether you miss something, it's whether or not someone else will find it in time to cost you your job. (nods sagely)

  • by pyrrhonist (701154) on Tuesday July 20, 2004 @12:12AM (#9745185)
    How Would You Handle a $1,000,000 Coding Error?

    As long as I keep checking in my code as someone else, I won't have to.

  • by goon (2774) <goonmail@nOSpAM.netspace.net.au> on Tuesday July 20, 2004 @12:13AM (#9745195) Homepage Journal
    In its 158 years, the Tribune failed to publish only at the time the Great Chicago Fire was destroying much of the city.

    So the paper can deliver every day for 158 yrs using mechanical printing presses ~ except where natural disasters occur ....

    The printing problems at the Chicago Tribune were related to efforts to upgrade computer equipment used to produce the newspaper, Malone said. The Tribune acquired customized software for the upgrade from an outside provider, and it contained a "coding error," he said.

    but as soon as computers are involved their printing press has morphed into a computer system. I wonder what provisions to *test* the upgrade before use where made?

    fail to recognise newspaper as computer system?

    it would be easy to blame the developers and company and there should be some recognition of responsibility for technical accuracy. but what about the newspaper. they have made a fundamental mistake in not recognising that printing press + computer = computer and let their newspaper system fail at the mercy of coding mistake.

    It seems while the paper can handle *mechanial* failure (158 yrs, 1 non delivery) it has yet to grasp *software* failure.

    • by Sycraft-fu (314770) on Tuesday July 20, 2004 @12:43AM (#9745424)
      It truly is a site to see. The speed at which they print is fantastic. A minimum run on many of them is 20,000 copies, in the time it takes to spin up and spin down, that many will have come off.

      This is necessary too, if we wish to efficently print the massive quantity we desire. There are a lot of daily newspapers. Even in my small city there is at least 8 I know of. An old mechanical pres simply wouldn't be able to keep up. Never mind printing speed or anything else, setup time was a bitch. You had to have plates made to stamp your text on the page. These then had to be loaded and calibrated for each run that was to be done.

      Now it's all electronic. At the minimum, you place the reference prints under a camera, and normally the layout files themselves are loaded in to the press. It then can go to work right away.

      I know it's kind of retro-geek cool to bag on how much harder technology makes everything and how much better it was in "The good ol' days" but that's not usually the case. Old nechanical presses simply cannot compete with the speed of computerised presses, which are necessary to operate with the speed and efficency that is demanded today.
  • Where I work... (Score:4, Insightful)

    by bmo (77928) on Tuesday July 20, 2004 @12:17AM (#9745231)
    It doesn't make any difference if it's a broken punch or a whole set of dies cracked down the middle ($4000 for a 6 inch section, over 60 inches you do the math)...

    "If you say 'oops', it's OK."

    Did he say Oops?

    Seriously though...shit happens. That's why you don't bill employees directly for the mistakes they do. Suck it up, learn, and move on.

    --
    BMO
  • Bah (Score:5, Interesting)

    by Sandman1971 (516283) on Tuesday July 20, 2004 @12:18AM (#9745238) Homepage Journal
    Bah, this is absolutely nothing compared to the coding error that brought down Canada's Royal Bank last month, leaving millions of customers without paychecks, access to their accounts, etc.... And this too was attributed to human error [globetechnology.com], but had far more drastic repurcusions than not getting your morning paper, and cost RBC a heck of a lot more than a million dollars.
  • Testing is Boring (Score:5, Insightful)

    by PingPongBoy (303994) on Tuesday July 20, 2004 @12:48AM (#9745455)
    Software testing is boring boring boring. You have to try things out again and again after each change. Modules that haven't changed gain confidence in the face of changes and might not be tested, but omitting tests can end up being the Achilles heel. There can be an overwhelming desire when a project nears completion to just get things done and over with. After all the hard problems may well be solved and it's all down to seemingly inconsequential details.

    These days programmers have a Sword of Damocles hanging over them. Once they finish a major piece of code they may have a hard time finding new work. The economy has not lived up to forecasts of more jobs. Outsourcing has reduced computer opportunities. Management of many companies do not see new uses for computers. Off-the-shelf programs abound for almost every aspect of computerized work.

    Stress may distract software engineers enough that someone will make a major mistake.
  • Been there (Score:5, Interesting)

    by Inthewire (521207) on Tuesday July 20, 2004 @01:21AM (#9745611)
    I write software for a company that handles $45,000,000+ of client cash every week.
    A mistake I made in May (discovered this very day, by yours truly) had backed up about $400,000 per week.

    Did I get stomped?
    No.

    A bottleneck had been identified, repaired, and eliminated!
    Behold the power of positive thinking.

  • $1mil is nothing (Score:4, Insightful)

    by lkaos (187507) <anthony.codemonkey@ws> on Tuesday July 20, 2004 @01:38AM (#9745701) Homepage Journal
    It's never an individual's fault. It's a breakdown in the QA/FVT/review structure. Is it the person who coded it's fault? Is it the team that reviewed the code? Is it the author of the FVT tests? Is it the person in charge of QA?

    What's that you say, this is all the same person? No wonder you had the bug to begin with...
  • by geekoid (135745) <dadinportlandNO@SPAMyahoo.com> on Tuesday July 20, 2004 @03:03AM (#9746097) Homepage Journal
    Grab all the eMails where someone in management told you to cut a corner, or replied that they didn't want to spend too nuch time designing, or authorized fewer QA hours then should have been done, and print it all out, with headers, and forward it all to another account.

    When they come after you, present it as if it you were trying to do it right, but somebody wouldn't let you.
    If they fire you, sue.

    Unless:
    a) you work for one of the few companies that actually supports a real team atmosphere, or

    b) Everything was done by the book, and you still screwed up.

    When someone in an industrial field is forced to work 16 hour a day, 7 day a week, and has a mistake the company suffers the ramaifications, not the worker(or the workers faimly).
  • by carldot67 (678632) on Tuesday July 20, 2004 @03:51AM (#9746294)
    "The poor schmuck" will, in my experience, have spent the last 18 months hearing phrases like:

    "Time / Quality / Functionality: Choose Two"
    "You can't test quality into a system"
    "Measure twice, cut once"
    "We need to parallel run the UT system"
    "Engineers shouldn't be testing their own code!"
    "I wouldn't be using NT for that, mate"

    and so on.

    These are the words technical people use to warn management of impending doom. Managers on the other hand have other things to worry about like delivery dates, sales, penalty ratchets and so on. When the "go" decision was made it will have been made by senior managers who get paid the big bucks to take the big decisions and the big sh*t when it all goes pear shaped.

    The question is how the management handled mitigation by way of backups to manual processing, rollbacks to the old system or risk analysis during project planning.
    Automation of an entire printing plant is a big job and it is probable they planned for a failure as a worst case scenario and will just put the 1M loss down to experience.
  • by Scud (1607) on Tuesday July 20, 2004 @05:46AM (#9746635)
    Which time? I'm the guy who (unintentionally) wrecked the first Saturn ever wrecked (job #65). Since then I've wrecked one other (job 2 million and something), so my track record isn't that bad :)

    Most of the time you don't actually break something (be it product or be it equipment), but fixing the bug and getting everything rolling again takes time.

    And since the "value" of the product that is running on the line is about $5000 a minute, time is indeed money.

    I've probably had a couple 1+ hour breakdowns, but this doesn't even compare to the time my buddies plant went down for three days x 2 shifts per day ($14M).

    They were Lear-jetting parts in on a daily basis (they kept blowing up the new stuff and didn't seem to have the sense to order spares). Ron would show up at the service entrance at the airport to pick them up and it got to the point where the guys would just open the gates when he drove up :)

    My most recent one was when we changed the line speed of the skillet line and the thumbwheel switch messed up and opened up the 8's bit in the ten's digit (faulty thumbwheel switch) so that instead of running at 42 jobs an hour it was trying to run at 80 JPH (it would have tried to run at 122 but it's limited in the software to 80 JPH)

    Zoom zoom.

    Oh wait, that's the other guys :)

    John

  • by Anonymous Coward on Tuesday July 20, 2004 @06:41AM (#9746824)
    I work as a system administrator for a newspaper since 7 years back. 5 Years ago we were out-sourced to another company, my job stayed the same (save for extra work needed) but the decision paths and cost terms has changed a lot. -- More management, less money, cutting corners, less contact with customers has actually led to an increase in costs by 25% for the newspaper.

    For 5 years we have worked on cutting costs instead of doing what we originally did; produce a newspaper. This has led to a lot of cut corners, patchy systems and above all stupid decisions. Now we have to spend most of our time with our hands tied behind our backs because there's no way to prove a _direct_ profit we can put on the price-tag we show to a (non-technical) customer when we are suggesting a change. It's always cost > functionality.

    Companies that only sell services to customers has no goal, does not work. There has to be something you produce, something to live for instead of just being a money making machine.

    Management cannot be just management to be management. A good manager is someone involved working with something they have a passion for. My boss didn't create this newspaper, nor did the boss of the actual newspaper and they probably don't have a special interest in media, it's just a career pushing money making machine for them.

    Oh, I guess this turned into a rant :)

What the large print giveth, the small print taketh away.

Working...