How Would You Handle a $1,000,000 Coding Error? 878
theodp writes "The Chicago Tribune's efforts to upgrade its computer system over the weekend turned into a fiasco when the system crashed, halting all printing operations and leaving about half of the Trib's subscribers without papers. The software contained 'a coding error,' according to a spokesman who estimated the cost to resolve the problem at 'under $1 million.' Any advice for the poor schmuck who's going to get the blame?"
The Coder? Nothing... (Score:1, Insightful)
It's my first week! (Score:5, Insightful)
Now if the cause was insufficient testing, well then QA has to answer for it.
And if there's no QA, well that's managements fault...
Now if it all comes down to dumb circumstances, it's poor planning on the papers fault for not testing themselves
That said, fess up, worse comes to worse, you now have national infamy, and any fame is good fame, right??
Testing? (Score:5, Insightful)
A good test should have identified some errors, especially if it blew up IMMEDIATELY.
Always blame the coder... (Score:2, Insightful)
Toss his newspaper subscription and egg his car. Other than that, leave the poor geek alone.
How many people here have fucked LILO into the ground the night before a java assignment on a laptop with no floppy? anyone?
yeah. i thought as much.
1 million is not that much (Score:5, Insightful)
The culprit should just admit it. Shit happens, it's unavoidable even if you take all precautions. Don't make the same mistake again, though.
Re:Dogbert Strategy (Score:5, Insightful)
No one person should be at fault (Score:5, Insightful)
Deployment? (Score:5, Insightful)
You don't just change a system like in a weekend. There WILL be problems, so you have to have ways of dealing with it. Maybe that means flicking the switch back to the old system if it fails, or maybe it means running with degraded capacity a while, but whatever it is, it's dead-in-the-water is not your Plan B.
UAT/QA anyone? (Score:5, Insightful)
The person writing the code can unit test to his or her best ability, but it is really the job of someone else to put it through the wringer testing thousands of simulated real-world scenerios. Sure, a coder could do this testing. But a QA guy or gal is doing really well if he makes 3/4 the salary of the guy who wrote the code- so a divison of labor only makes sense.
Not to mention the person writing the code makes the worst tester in the world. You only test it the way you THOUGHT people would use it. So, while a coder is perhaps the one who created the original problem, the real fault is in whoever let this slip through to production. Assuming, of course, that it wasn't some kind of time-bomb easter egg that would have been impossible to test. Although, good QA testers should alter their system date/time when testing date sensitive routines.
planning? (Score:5, Insightful)
Good planning would have had an abort procedure, so the show would go on. Everything changed should be undone if it did not work. They could figure it out after the paper was printed.
Errors are inevitable. Good planning and implementation keep you from falling on your face even when you publish seven days a week. It's not the coder's fault.
Nothing to see here (Score:5, Insightful)
Noone [in their right mind] orders a brand new paper publishing system from a single consultant. The software probably was priced in several million dollars. Somewhere between the components something broke. For example, the file format that the publisher produced was rev. 2.1, but the software at the presses side was only aware of rev. 1.7 and below... If the coder only tested his code with the "other" piece of latest revision, he would never see any problem; and it is not his guilt that in real life the real customer uses some obsolete stuff that isn't compatible...
This kind of problem is clearly of administrative nature, of a system design and of checking which pieces work with which other pieces. Clearly, blame should be assigned to non-existent QA procedures, insufficient unit testing and [obviously] inadequate integration of components. The coder is nowhere here, it's all system design and QA stuff, realm of managers.
Re:Testing? (Score:5, Insightful)
Blame the project manager (hopefully their was one) that led testing the services thoroughly before deployment. Individual coders shouldn't be held to any legal liability.
Any legal action should be directed towards the'outside provider' (as noted in the article).
Shift Blame to Testers and Project leader. (Score:2, Insightful)
Where was the testing?
Who decided that there would be no testing?
Who decided that they would simply deploy the thing with no plan B?
Obviously the PHB, you just need to point that out to the VP and you can have the PHBs seat.
computer + printing press = computer (Score:5, Insightful)
So the paper can deliver every day for 158 yrs using mechanical printing presses ~ except where natural disasters occur ....
The printing problems at the Chicago Tribune were related to efforts to upgrade computer equipment used to produce the newspaper, Malone said. The Tribune acquired customized software for the upgrade from an outside provider, and it contained a "coding error," he said.but as soon as computers are involved their printing press has morphed into a computer system. I wonder what provisions to *test* the upgrade before use where made?
fail to recognise newspaper as computer system?it would be easy to blame the developers and company and there should be some recognition of responsibility for technical accuracy. but what about the newspaper. they have made a fundamental mistake in not recognising that printing press + computer = computer and let their newspaper system fail at the mercy of coding mistake.
It seems while the paper can handle *mechanial* failure (158 yrs, 1 non delivery) it has yet to grasp *software* failure.
Where I work... (Score:4, Insightful)
"If you say 'oops', it's OK."
Did he say Oops?
Seriously though...shit happens. That's why you don't bill employees directly for the mistakes they do. Suck it up, learn, and move on.
--
BMO
If it ain't broke, don't fix it! (Score:2, Insightful)
Re:McDonald's (Score:2, Insightful)
We are not living in the same world then. If you screw it up bad enough for someone to get injured or - god forbid - die by it, the figures will probably be 10 million times as big as what you are mentionning.
1,000,000 is nothing... (Score:2, Insightful)
Imagine a satellite, nearing completion, bolted down , and ready for final inspection. Joe Blow forgets to write in the change log that he took the un-bolted the satellite from the base. Workers come in the next day, do some work after checking the logs, and... the satellite tips over. OOPS... a billion dollars well spent. That is 1,000,000,000.
"Uhm, boss, the good news is we finshed the satellite yesterday and... I don't know how to say this, but our last two years of work... well, I sort of... well, I tipped it over and it's destroyed.... "
Or how about the contracts guy who forgets one Zero on a contract. Instead of ten million, the contracts reads one million. Of course everyone misses the zero... except the people PAYING. Contracts are signed and oops... "we want to start a new contract, we sort of forgot to add a zero." To which they reply, "Fuck off, you signed it..." and prompty save the company 9 million dollars.
Re:I would get drunk. (Score:5, Insightful)
I think I'd rather debug someone else's assembly language than someone else's perl.
Re:Testing? (Score:2, Insightful)
Testing is Boring (Score:5, Insightful)
These days programmers have a Sword of Damocles hanging over them. Once they finish a major piece of code they may have a hard time finding new work. The economy has not lived up to forecasts of more jobs. Outsourcing has reduced computer opportunities. Management of many companies do not see new uses for computers. Off-the-shelf programs abound for almost every aspect of computerized work.
Stress may distract software engineers enough that someone will make a major mistake.
Re:UAT/QA anyone? (Score:1, Insightful)
Re:It's my first week! (Score:2, Insightful)
Deploying good code is a bussiness descision, not QA's decision.
Re:How to handle $1,000,000 coding error? (Score:1, Insightful)
Check the Jobs section soon (Score:4, Insightful)
Re:why wasn't this caught in testing? (Score:3, Insightful)
And that is why you're on the coding end instead of the decision making end - you'd have a compact, bug-free, featureless product that hit the market three years too late that nobody could afford to buy anyway.
It clearly wasn't a coding error (Score:3, Insightful)
The serious error was in switching to a new system with such clearly inadequate testing.
I'd fire the CTO. (Score:3, Insightful)
Start with the CTO and work your way down. If it's a software problem, why wasn't it discovered sooner? Who was in charge of QA? Who was in charge of making sure QA did their jobs? Who said YES WE CAN DO IT!, lying out their ass?
The fun thing about capitalism is greed and/or the desire for profit leads to systems like this being built by the lowest bidder.
$1mil is nothing (Score:4, Insightful)
What's that you say, this is all the same person? No wonder you had the bug to begin with...
Re:Deployment? (Score:5, Insightful)
Probably in the hands of someone who decided:
(1) Cost of catastrophe: $1,000,000.
(2) Chance of catastrophe: 5%
(3) Cost of setting up parallel system, including hardware, software licenses, system administration: $250,000.
If (1) times (2) is less than (3), then it's actually better not to spend the money on (3).
Of course you can argue with the actual numbers in (1) (2) (3). (1) is the Tribune's own estimate. (2) is estimable by looking at the history of past projects, I'm just guessing 5%. And I just pulled (3) out of the air.
That said, I bet they do have degraded capacity, and that they used it to print half their papers on Monday and all their papers on Tuesday.
grow canabis, stupid morons.... (Score:4, Insightful)
A) grows 10000x faster than trees
B) makes 10x more pulp per acre
C) uses 100x less water.
D) stick it to the govt.
But would they ever do that? NOOOO coz there are no patents in the process to expoit and oh the trouble of the govt wackos like bush n old guys being so anti-canabis (to protect their buddies profits)
I guess they wouldnt want 100s of pot heads heading up to the 100000s acres of weed to take a few home, but what is so wrong with that OTH?
Re:Do as any knee-jerk slashdotter would... (Score:2, Insightful)
Note that I am not against standarization and cost-cutting per se, but in their situation, there are a lot more tradeoffs than in a traditional manufacturing environment. Putting out a newspaper is difficult as it is (ah, the war stories...), but you would think management would understand that people will notice if the paper doesn't go out in the morning. Advertisers must have flipped!
The only way it would have been worse would have been to happen on a Thursday. You don't really think all of those hefty Sunday papers are printed Saturday night, do you? :)
Re:1 Million? That's nothing! (Score:2, Insightful)
Re:Dogbert Strategy (Score:2, Insightful)
I got out. If you're smart and just keep your lips shut, maybe you will too.
Re:grow canabis, stupid morons.... (Score:4, Insightful)
A) grows 10000x faster than trees
B) makes 10x more pulp per acre
C) uses 100x less water.
D) stick it to the govt.
I think you forgot your "...profit" clause, except here it would say
D) Use a bunch of arguments of dubious value to misdirect attention from the fact that what you really want is to get stoned
Chicago Tribune IT -vs- Tribune Company IT. (Score:1, Insightful)
The key issue here is that Chicago Tribune has a very limited IT group, really it's just a "systems" group. All of the other traditional IT functions have been outsourced to Corporate.
That was fine before the Times-Mirror merger, because Chicago Tribune was the top dog, and could bully all the services they wanted out of Corporate IT. But now the LA Times has that spot.
The good news is, the FUBAR situation on Monday morning was strictly a "systems" issue within Chicago Tribune IT, and not just a "systems" issue, but one where "systems" could direct all the blame to their vendor, which might help them keep their jobs a little longer...
You really only have 1 choice (Score:5, Insightful)
When they come after you, present it as if it you were trying to do it right, but somebody wouldn't let you.
If they fire you, sue.
Unless:
a) you work for one of the few companies that actually supports a real team atmosphere, or
b) Everything was done by the book, and you still screwed up.
When someone in an industrial field is forced to work 16 hour a day, 7 day a week, and has a mistake the company suffers the ramaifications, not the worker(or the workers faimly).
Re:It's my first week! (Score:5, Insightful)
I work in newspapers, and have for the past 7 years. The blame for this fiasco should be pinned directly on the project manager. Not the coders, not the people trying to get the thing running, but the project manager. Right in the middle of his fucking forehead.
I've torn the guts out of many newpaper networks upgrading or improving them, but never have I ever put anyone in the position of "If the new system doesn't work, we're fucked." I've always made ab-so-fucking-loutely certain there was a fall back position where the paper would hit the press. I actually had this conversation before:
<Management weenie> What happens if this new server fails?
<me> I haven't touched the old server. If the new one hiccups one whit, we fire up the old box and produce product.
<Management weenie> I don't like that - we've spent a million bucks on the new gear. Delays make me look bad.
<me> Well, if you're willing to man the phones when the advertisers call demanding re-prints of thier ads because of human error somewhere, I have no problem with it.
<Management weenie> You're an asshole. I could have you fired.
<me> In this instance, I'm paid to be an asshole. You can't fire me for doing my job.
<Management weenie> Heh. OK, we'll go with your plan.
Not planning some way to get the paper on the press is dereliction of duty, and deserves your professional head to be lopped off.
Is there _no_ professionalism anymore? Fuck, I should be paid more. Morons like that burn me - when you blow up a critical system with no backup, it's not just your livelyhood, but for everyone who depends on that system functioning as needed - it's thier livelyhood as well. Fucking morons.
Soko
Re:One-line CODE ERROR $60 million - AT&T phon (Score:2, Insightful)
This shows why truly redundant systems should be build using a mixture of different hardware and software developed by independend teams. This would reduce the risk of all devices being hit by the same problem.
Re:Do as any knee-jerk slashdotter would... (Score:3, Insightful)
Re:The Coder? Nothing... (Score:2, Insightful)
Ah, yes, but now you are a step higher on the corporate ladder, and while in conversation with colleages the finger of blame always points up, in conversation with the boss however the finger always points down that ladder. Management is never to blame for bugs.
Re:It's my first week! (Score:3, Insightful)
I was doing a much smaller upgrade this weekend - rebuilding a single server. Before I did anything, I removed the drive, imaged it, and placed it in a very safe place far away from coffee spills and clumsy feet.
If anything went wrong during the rebuild and I'd been unable to bring the new system up by Monday morning, I'd simply slip the old drive back in and continue from where we were on Friday afternoon.
Re:I would immediately fire anyone (Score:2, Insightful)
Maybe you meant strcpy and sprintf (use strncpy and snprintf instead!).
Code safety isn't about the use of individual functions, or even languages (I've managed to DoS a Java app by making it allocate strings forever in a loop... the code was written such that the GC never cleaned up). It's about good practice, often learned through bitter experience (OTOH I can't think of a safe use for gets()...)
Re:grow canabis, stupid morons.... (Score:3, Insightful)
Backout plan? (Score:3, Insightful)
Of course, none of that matters if you haven't placed a "production-like" loan on the system before attempting to upgrade production. It is about risk management. You can't remove all risks in an upgrade, but you should be able to manage them assuming the risks are all documented and provided to stakeholders. If you've told the stakeholders all the risks clearly and in writing and pointed out that there was no test system, no DR system or no good backout plan that wouldn't impact half the customers **AND** they still decided to go forward, oh well. Their decision.
management stupidity (Score:3, Insightful)
In the past, creating single points of failure was hard: you had lots of men working on lots of printing presses. You couldn't do something as stupid as replacing them all in a single night--it just wasn't physically possible. Computers have just given greedy management the freedom to make more serious mistakes in a shorter amount of time. In this case, the mistake was upgrading a whole infrastructure at once and believing, naively, that that would necessarily go smoothly.
Re:You Slashdotted Illinois (Score:3, Insightful)
He should suffer! Everyone knows that slashdotting Illinois is the job of the highway department.
Re:It's my first week! (Score:5, Insightful)
"We need to reduce spending in non-core areas!" IT usually ends up being defined as non-core (unless you're an IT company).
Suddenly management questions you if you want to buy so much as a network hub (el-cheapo consumer grade at that - not for infrastructure). You have to justify any expenditure, and so the guys on the bottom just stop asking since it is such a pain.
I'm sure anybody on that failed project could have identified steps that would have yielded a fallback. They could have built a new server, and then switched it out with the old server and kept the old one ready to go in an emergency for a couple of weeks. But that would require a $2000 server requisition - or maybe $3000 since the corporate standard was picked by some idiot on the vendor's kickback list.
For the guy on the bottom, they look bad for asking for money, and chances are that the fix would have worked fine with no failsafes at all - the last 15 upgrades probably did. He has to ask for money each time, and will have nothing to show for it.
On the other hand, every person on that project was probably thinking the same thing. Sure, spending $2k is a good business decision, but upper management wouldn't recognize that, so let's just not ask. We won't point out how much we're saving on server hardware by not having backups - we'll just let our overall expenses speak for themselves and not call attention to our negligence. And then we'll get promoted year after year and if something goes wrong we just all look dumb and nobody understands computers anyway so management will just figure that these costs come up any time you use one.
And you know what? This approach usually works in the end.
The real responsible party is the one which made cost-cutting-at-any-cost the corporate line. Oh, sure, the corporate policies usually have exception clauses, but what bottom-rung employee is going to bother running a request 12 links of the chain of command just to spend an extra $1000 on hardware? The opportunity to use it would pass before it ever got approved.
The problem is the question-everything approach of corporate fiduciary management. Sure, there is waste out there, but it doesn't take many botched migrations to drarf what you save by pinching pennies...
Re:You forgot... (Score:3, Insightful)
Re:You forgot... (Score:4, Insightful)
My experiences tell me cannabis is a much more desirable drug than alcohol, both from the users and society's point of view.
Use both drugs with some sense and nothing bad will happen. Overdo alcohol and it will make you loud and often aggressive. Overdo cannabis and you will fall asleep (which can be loud but seldon aggressive). Neither are very suitable for driving. (Although I prefer people who smoked over people who drank: they drive more relaxed.)
It is when talking addiction that the large difference arises. Alcohol is a hard drug, you get physically addicted, cannabis is not. Alcohol demolishes you while it degrades you. Cannabis use over large timeperiods is claimed to deteriorate memory. (So, don't drink to forget, smoke! ;) If you smoke the cannabis (instead of eating it) you get the same risks as with tabacco use.
Here in Belgium, cannabis is more or less legal now (we are allowed to carry upto 3.3 grams on the street and use it in private places and such). It is a good thing, because we did that anyway (I live about 40 kilometer from the closest cannabis shop in the Netherlands where I can buy as much as I like legally).
There were no sudden changes in behaviour. No millions extra addicts, no stepping stones, nothing. The people who are inclined to (ab)use drugs usually do not care about legality.
Re:It's my first week! (Score:2, Insightful)
Spending the money to parallel the new system until its clear it works looks expensive until something like this happens -- then it looks cheap.
I also notice you have a PLAN for what to do if things go wrong. That is also very smart. When things break and everyone panics, it's good to be able to just pick up a procedure worked out in calmer moments and go with it.
Re:McDonald's (Score:2, Insightful)
The difference between dying from obesity, and dying from food poisoning is that the first is your OWN decision
Re:You forgot... (Score:3, Insightful)
The biggest problem of drugs is their connection to organized crime. They have little or no moral and are only interested in porfit and making yuo addicted to the thing... If we accepted the fact that people DO take drugs on a daily basis and that we should help them, things would change. But because of the NIMBY problem people just want the problem to 'go away'...
Someone proposed to open centers (Forgot the English term) where you could come get clean stereilized needles and medical care to take you drug... The argument being that if you don't help those in trouble, they will STILL take the drug with old (infected) needles...
It got a wave of protestation because poeple didn't want this around... In the mean time infections continu to spread and the problem doesn't solve itself...
Of course, the best way would be to legalize every drug, to restric possetion to REALLY small doses and to restric the sales to the Gourvenement... If you don't follow thoses rules, you get fined, and not 5 years prison like in the US...
Of course, i'm just dreaming in technicolor(c) here...
Re:Just one (Score:4, Insightful)
You know.. being involved in such an accident changes you for life, its not like most people who get involved in this will ever be able to put it aside and forget about it.
Adding social pressure to that is not going to solve much at all, not for the victims either.
No matter how terrible the results, accidents happen, and we'll haev to live with that. Yes, we need to deal with the consequences, but an attitude that results in more people paying for the rest of their life as a result from accidents is not going to accomplish that, it is only going to generate more 'guilty' people who are too much stuck in solving their guilt issue and can't contribute to societuy as a whole as a result.
Re:Do as any knee-jerk slashdotter would... (Score:2, Insightful)
Newspapers are 24/7 operations, usually with a mishmash of various vintages of systems all intricately tied together, so it is *very* difficult to schedule and perform upgrades. It would be quite tempting while setting up new servers to update to the latest server software, which requires new desktop software, etc., etc.
The problem with doing this, of course, is that if something breaks, you have precious little idea where to look for the problem. It sounds like everything was tested, and the 'folks who know' were long gone by the time trouble began.
It's not really surprising that problems cropped up - I've been involved in newspaper software/IT before, and that's par for the course with these systems. What does surprise me, however, is their apparent inability to deal with the situation, either by rolling back to a previous system, using a series of workarounds, etc.
At the papers I've dealt with, the attitude of "the show must go on" extends well into the server room. There are thousands of critical functions that can go awry, not just with the publishing system, but with presses, satellite news feeds, etc., yet somehow the paper ALWAYS goes out. (I guess, even in this case, it did get done to some degree.) The level of determination and cleverness this elicits from people is an amazing sight.
It sounds like the Trib has lost some of that sense of "whatever it takes," which is a shame.
So I'd blame it on inadequate investment in staffing and backup/alternate systems (it was standard practice to literally have "two of everything" for exactly this sort of situation), and lack of access to knowledgable support from the vendor (it is CCI *Europe*, after all.)
I feel for those involved; I really do. It's easy to watch from a distance and say "they should have known! they should have planned ahead!" But the reality is that everyone who runs those systems has their fingers crossed every minute of every day, hoping they're ready when the shit hits the redundant cooling units for the computer room...
Re:It's my first week! (Score:2, Insightful)
One guy alone does not make a mistake like this (Score:2, Insightful)