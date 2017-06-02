British Airways IT Outage Caused By Contractor Who Accidentally Switched off Power (independent.ie) 79
An anonymous reader shares a report: A contractor doing maintenance work at a British Airways data centre inadvertently switched off the power supply, knocking out the airline's computer systems and leaving 75,000 people stranded last weekend, according to reports. A BA source told The Times the power supply unit that sparked the IT failure was working perfectly but was accidentally shut down by a worker.
Did they try... (Score:3)
Re: Did they try... (Score:1)
text book example of a "career changing event"
Re: (Score:2)
Re: (Score:2)
Not necessarily. I've been in a few of these "career changing event" over the years. If I make a mistake, I step forward, take responsibility and fix the problem (if I can). Mangers are less likely to punish someone who comes forward immediately. In other cases where blame must be assigned, I've already documented my actions and sometimes the action of those around me. If my CYA is stronger than everyone else's, I'm not going to get blame for something that I didn't do.
Re: (Score:2)
Yes they did and when they did the incoming surge burnt down the system.
Re: (Score:2, Insightful)
Bullshit.
Even a brand new IT graduates knows computers should be plugged into UPS devices that protect against this.
Handling power outages is about as basic of an IT task as they come. Basic Lock Out practices that prevent power from accidentally being turned off is also Server Maintenance 101.
For this to actually have been the cause means their IT organization was run by rank amateurs.
Re: (Score:2)
A sort of similar thing happend to me, in 1991, running netmare, at a business location. Conscientious employee made sure she shut the office down at the end of business. I got to run out their the next morning.
I fixed it with duct tape over the power switch ('server' was a desktop, AT power supply). Wrote 'touch this and die' on it in sharpie. Arranged to have 'server closet' locked, which was good for cutting down dust as well.
Can I be BA director of IT now? I'm obviously better qualified, despite be
Re: (Score:2)
Re:Did they try... (Score:4, Interesting)
While a there's a lot of mocking of BA going on at the moment, that's actually a pretty easy situation to get into if you've expanded a DC - or even an regular equipment room - over several years without proper power managment, and BA is far from the first company to be caught out. So, if you are responsible for some IT equipment rooms, here's two things to consider; what's the combined total power draw of all the equipment in each room on power on (don't forget to include any UPS units topping up their batteries!), and what the maximum power load that can be supplied to each room? If you can't answer both of those, or at least be certain that the latter exceeds the former in each case, then you've potentially got exactly the same situation as BA.
None of which excuses BA from not having the ability to successfully failover between redundant DCs in the event of a catastrophic outage at one facility, of course.
Re: (Score:2)
There are two ways to engineer power for a datacenter. 1) You can engineer for maximum efficiency/lowest cost or you can engineer for redundancy/max safety. Penny Pinchers always choose the former, and IT guys usually want the latter.
Here is the real equation: Cost * likelihood of of catastrophic event. If you think 100,000 * a
.0000001 chance of catastrophe, you err on the side of savings. On the other hand, if you think $25 * 100.00 chance of catastrophe, you err on the side of cost.
My guess, is that they
Re: (Score:2)
Re: (Score:2)
"Holy Mother Of All Single Point of Failures, Batman!"
Well, if the contractor is like some of the ones I know, he will justly say, "I was instructed to turn off the switch . . . not to turn it back on again!"
Which brings to the obvious point: Which British Airways employee was responsible for the work being done? Blaming the lowly contractor is a complete shift of the blame to someone who obviously couldn't know any better.
Or is British Airways an example of "Contractors . . . all the way down" . . .
Re: (Score:2, Insightful)
Re: (Score:2)
If you can't see the difference in the two articles you have bigger problems than being in the matrix.
Re: (Score:2)
The difference is British Airways shifting the blame to someone else.
Re: (Score:2)
Right, the story has been updated and so news websites (and sites that pretend to be news websites) post a new article about it. Slashdot is great at dupes, this isn't one though.
LOL (Score:2)
Seems like this 'test' to see if the UPS would kick in didn't work.
So the CEO _should_ resign after all.
Re: LOL (Score:4, Insightful)
Right. It's not the poor guy that turned off the power supply. It's the shit-for-brains managrrs who wouldn't let the engineers put in redundant power supplies and hired cheap lobour that had no clue how to architect for fault tolerance.
Re: (Score:2)
who wouldn't let the engineers put in redundant power supplies
That's an interesting assumption. Have you seen anything even remotely indicating that the data centre didn't have redundant power? No amount of redundancy has ever withstood some numbnuts pushing a button. But i'm interested to see your knowledge of the detailed design of this datacentre.
Hell we had an outage on a 6kV dual fed sub the other day thanks to someone in another substation working on a wrong circuit. He was testing intertrips to a completely different substation, applying some power to an intert
Re: (Score:2)
He should resign because he apparently relied in a single UPS.
Re: (Score:2)
That doesn't help if there is one master switch, in case of (for example) fire, and he activated it.
Re: (Score:2)
That doesn't help if there is one master switch, in case of (for example) fire, and he activated it.
More like a extension cord stretched across a busy walkway just waiting for someone to trip on it.
Re: (Score:2)
At the moment, there is no reasonable way to tell between various scenarios.
It could go all the way from 'worker pressed big red button despite being told not to, signs telling him not to, and having signed an agreement not to', to 'worker followed what they believed was procedure and did what 99% of people would have done', to 'worker did precisely as instructed and are being scapegoated'.
Re: (Score:2)
Worker: The sign says "Do not use"
Manager: I don't care what it says, flip the switch
Worker: That's a really stupid idea
Manager: Do it, or you're fired
Worker:
Manager: Well, now you really screwed things up, you're fired!
Re: (Score:2)
We also know one other thing: no one up in management will accept responsibility. All upper managers will be shielded from personal res
Re: (Score:2)
Switches such as that should be locked out, requiring multiple people to allow access.
If you have a switch like that accessible so that just anyone can flick it off, you are an idiot.
Re: (Score:2)
'flick it off' may include 'opened the interlocks and keyed in the code as he believed he was doing the correct thing'.
This could be a personal failure due to stupidity, a training failure, or he was in fact instructed to turn it off, and though he protested, is now getting scapegoated.
That still doesn't explain why (Score:1)
they didn't just switch over to their DR site.
Re: (Score:2)
they didn't just switch over to their DR site.
You forgot the mic drop.
Bright side (Score:3)
Floor got cleaned cheaply and everyone got home early. Long live outsourcing!
Of course I didn't RTFA! With respect to outsourcing there's no difference between strategic and daily tasks like cleaning and strategic planning. Both need to be done short and long term. I can understand outsourcing occasional tasks but daily and strategic stuff will always be needed. Outsourcing of those tasks is a sign of utterly bad management.
Re: (Score:1)
Outsourcing critical functions to experts is good management or when you get sick do you try to do the surgery yourself? At one point company towns used to have company doctors. Now we have hospitals. Much better as a doctor working for the company will try to push you back to work.
Same way in the past we had in house IT techs. Now its done by companies who specialize in running IT systems.
Stop with your dream of being a company IT Tech in a company town.
I know the feeling (Score:1)
Been there, sort of done that.
Years ago I was in the basement of a 5-star hotel in South Africa, busiest time of the week, everyone was checking out, and I had to install a simple little Novell Netware to internet gateway machine, and there was one spare port on the power strip. Something shouted out in my head, "Don't put it in that one!", but I thought "The machine supplied tests fine, the cable is approved... what could possibly go..." *BLAM*, everything went down and took a few hours to get back up as
Out of band (Score:2)
N+1 guess not (Score:3)
So it was all running in a single DC with a single power bus? Plenty of room at real datacenters they need to stop running out of a closet somewhere.
Yeah, yeah... blame the contractor... (Score:5, Insightful)
Re: (Score:2)
Design error you probably mean?
Operational error. Backup systems need to be periodically checked to see if they still working as designed. If the backup system got tested and failed to work, then it would then be a design error.
What the heck does this switch do? (Score:3)
Re: (Score:2)
Stephen Stucker unavailable for comment (Score:3)
"Just kidding!"
Root Cause (Score:2)
What they MEANT to say is that. . . (Score:2)
. . . . the power was turned off by a FORMER contractor.
Then again, BA probably promoted him to executive VP.. .
Human Error? Sue but Still... (Score:2)
Human Error accounts for 99% of actual power outages in my experience. It's ALWAYS some idiot throwing the wrong switch, unplugging the wrong thing, yanking the wrong wires or spilling something in the wrong place...
You simply cannot engineer around stupid well enough to fix it, regardless of how hard you try..
That being said... For a mission critical system in a multi-million dollar company like BA where was the backup site in a different geographic location that was configured to take over in the not-so-
This is the ultimate single point of failure. (Score:2)
The first thing I think of is anything happening at tat location - flood, bomb, larger grid outage lasting more than a day or so - and BA is finished.
Heck if you were a terrorist now you know exactly where to attack that would truly hose an entire company that brings in a lot of money (and people) to England...
Re: (Score:2)
They had an offsite DR. The DR was setup wrong and did not have the latest data so when they switched to it they started seeing wrong data and had to switch it off
Re: (Score:2)
They thought they had DR. They were wrong.
Somebody responsible should have signed off on the plans and routine testing schedule for that. It is a key job responsibility.
Re: (Score:2)
You simply cannot engineer around stupid well enough to fix it, regardless of how hard you try..
Nothing can be made foolproof . . . because fools are so ingenious."
not the contractor's fault (Score:4, Insightful)
When your business depends on your IT infrastructure like that, turning off the power to a single machine or data center shouldn't bring down your operation; that's just stupid and bad design. Good enterprise software provides resilience, automatic failover, and geographically distributed operations. Companies need to use that.
And they should actually have tests every few months where they do shut down parts of their infrastructure randomly.
Re: (Score:2)
Systems in a data center should have two different power systems. The contractor shut one of them down to do some work. That should have been fine. I would guess that the work was to replace or repair some of the power infrastructure. The most likely situation here is that the contractor switched off the wrong one, and the correct one was already off (possibly due to the failure for which the contractor was called in the first place, or else someone had already shut it off for him).
Process errors like t
Re: (Score:1)
Shit happens (Score:1)
Simple case of CEO GREED .. (Score:1)
He gutted the knowledgeable staff and replaced with inexperienced outsourced help.
Incoming power would/should have been the first thing checked.
Mental picture from the movie 'Airplane' (Score:3)
of Johnny unplugging the extension cord from the wall and the lights on the runway going out. "Just kidding!"
https://datacenteroverlords.files.wordpress.com/2017/01/airplane.jpg
Yeah, sure (Score:2)
This is just one step up from the cleaner killing a patient because they unplugged the life support machine to vacuum in the room.
Pull the other one, it's got bells on it.
Why the power went out is unimportant (Score:2)
As for the power outage - A UPS test to check if power transferred to battery/generator that failed maybe?
A bigger boy did it and ran away... (Score:2)
Sounds like a load of baloney to me and really explains nothing. Sounds, in fact, like a cover up from someone who doesn't understand the implications of their lie.
It still doesn't explain why everything went down so catastrophically. Why was there only one power source? What about back up servers and other redundant systems? Why was it so easy for a contractor to switch the power off? Was he following procedure. What about redundancy? Why couldn't he just switch it back on again (I know, but if its such a
How does one DR test in a 24/7 business? (Score:2)
I've worked in banking and real estate businesses where we had the luxury of being able to DR failover test things like redundant databases, WAN connections, power supplies...etc - knowing that if something failed we had time to put it back together - before the business and customers would notice the outage.
How does one actually fail-over test things in production in a 24/7 business - especially one that spans time zones all across the world?
Are lab simulations simply enough? I've never seen a lab environ
Re: (Score:2)
You do it in production because none of it should cause a massive failure. They bought a DR site and failed to test it. Working at some big shops the DR site was prod every other quarter.
Blame the Worker for Management's Incompetence (Score:1)
It was not working perfectly at all - there was a single point of failure, poor design with no redundancy responsible for critical infrastructure, clearly approved by senior management.
So no, it wasn't a contractor responsible for the outage. It was the CEO who did not ensure there was redundancies in place on critical infrastructure, business continuity was not tested and disaster recovery was not a thing.
Did they try to turn it off and on again? (Score:2)
:)
You can only idiot-proof so much (Score:2)
It's good practice to make things so simple that no one could possibly mess them up. It works in programming - look at how many JavaScript frameworks abstract an already sandboxed development environment to a point where "signalling intent" is basically all the developer needs to do. Or in hardware -- we're using HPE servers and there is literally a "don't remove this drive" light that comes on when a drive fails in a RAID set. That had to be a customer-requested change after one too many data-loss events s