Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

British Airways IT Outage Caused By Contractor Who Accidentally Switched off Power (independent.ie) 262

Posted by msmash on Friday June 02, 2017 @10:00AM from the getting-to-the-bottom-of-things dept.

An anonymous reader shares a report: A contractor doing maintenance work at a British Airways data centre inadvertently switched off the power supply, knocking out the airline's computer systems and leaving 75,000 people stranded last weekend, according to reports. A BA source told The Times the power supply unit that sparked the IT failure was working perfectly but was accidentally shut down by a worker.

This discussion has been archived. No new comments can be posted.

British Airways IT Outage Caused By Contractor Who Accidentally Switched off Power

Load All Comments

Search 262 Comments Log In/Create an Account

Comments Filter:

Did they try... (Score:5, Funny)

by 110010001000 ( 697113 ) writes: on Friday June 02, 2017 @10:03AM (#54534069) Homepage Journal

...turning it on again?

Share
twitter facebook
- Re: Did they try... (Score:5, Funny)
  
  by Anonymous Coward writes: on Friday June 02, 2017 @10:16AM (#54534189)
  
  text book example of a "career changing event"
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by DickBreath ( 207180 ) writes:
    
    The individual will merely have to change which contractor employs them. In this world, education is now proportional to the number of contractors you've been employed by. Sad. Terrible.
  - Re: Did they try... (Score:5, Interesting)
    
    by __aaclcg7560 ( 824291 ) writes: on Friday June 02, 2017 @10:57AM (#54534673)
    
    text book example of a "career changing event"
    Not necessarily. I've been in a few of these "career changing event" over the years. If I make a mistake, I step forward, take responsibility and fix the problem (if I can). Mangers are less likely to punish someone who comes forward immediately. In other cases where blame must be assigned, I've already documented my actions and sometimes the action of those around me. If my CYA is stronger than everyone else's, I'm not going to get blame for something that I didn't do.
    
    Parent Share
    twitter facebook
    - Re: Did they try... (Score:5, Insightful)
      
      by LS1 Brains ( 1054672 ) writes: on Friday June 02, 2017 @11:33AM (#54535071)
      
      I've been in a few of these "career changing event" over the years. If I make a mistake, I step forward, take responsibility and fix the problem (if I can).
      
      As an IT Manager/Director, THANK YOU. Everyone screws up at some point, it's what you do after that really matters.
      
      Parent Share
      twitter facebook
      - Re: Did they try... (Score:5, Interesting)
        
        by sabri ( 584428 ) writes: on Friday June 02, 2017 @04:17PM (#54537945)
        
        I pay people to not screw up so if you do I'm terminating you and finding someone competent.
        Which would be stupid. What do you think the chances are that this guy will repeat this mistake?
        
        Here is a story a friend of mine once told me. He was working on an AS migration of a major telco, when he made a big boo-boo causing a huge outage for hundreds of thousands of subscribers, making headline news. The next morning he got called into his boss's office, expecting to be fired. He was not. The reason why?
        
        His boss argued that this mistake made him more valuable, since he would not be making that mistake ever ever again.
        
        Parent Share
        twitter facebook
        
        Re: (Score:3)
        
        by Razed By TV ( 730353 ) writes:
        
        His boss argued that this mistake made him more valuable, since he would not be making that mistake ever ever again.
        I believe there is wisdom in this, but there is a prerequisite.
        The person must have the capacity to learn.
        
        I currently have the pleasure of working with someone who must repeat the same mistakes before he learns from them.
        He breaks off, on average, one screw a month.
        It's always the same.
        *WHIRR*
        *SNAP*
        "Oh crap!"
    - - Re: Did they try... (Score:5, Funny)
        
        by __aaclcg7560 ( 824291 ) writes: on Friday June 02, 2017 @12:18PM (#54535501)
        
        Almost forget... 3) Kiss my shiny metal ass!
        
        Parent Share
        twitter facebook
        
        Re: (Score:3)
        
        by __aaclcg7560 ( 824291 ) writes:
        
        You're a middle-aged man counting Slashdot karma points?
        Nope. I'm a middle-aged man who wrote a Python script to scrap my comment history. I'll run the script and post the stats when I get home.
  - Re: Did they try... (Score:5, Insightful)
    
    by jellomizer ( 103300 ) writes: on Friday June 02, 2017 @01:00PM (#54535919)
    
    You mean for the Executive who didn't approve of the hot offsite fail over solution ?
    You know the stuff that normal large organizations have to make sure their business can be operational.
    
    Parent Share
    twitter facebook
    - Re: Did they try... (Score:4, Insightful)
      
      by lactose99 ( 71132 ) writes: on Friday June 02, 2017 @02:02PM (#54536615)
      
      "we didn't budget for that"
      "well does your budget include a multi-day downtime when the primary site goes offline?"
      "now how could the primary site possibly go offline?"
      Unfortunately I run into this far more than I should in this industry.
      
      Parent Share
      twitter facebook
  - Re: (Score:3)
    
    by lactose99 ( 71132 ) writes:
    
    Resume line-item:
    - Single-handedly tested entire DR operation of British Airways
- Re:Did they try... (Score:5, Interesting)
  
  by Zocalo ( 252965 ) writes: on Friday June 02, 2017 @10:28AM (#54534333) Homepage
  
  Apparently that was what led to the major outage turning into a prolonged major outage. It seems that the sequence of events is now that the contractor turned off the power (presumably killing one of more phases), obviously leading to a large scale hardware shutdown. Someone (the same contractor, most likely) tried to restore power, as you would, only to find that the power surge of all that hardware switching back on at the same time, which often means that they are close to maximum power draw, overloaded the system and caused physical damage to the hardware.
  
  While a there's a lot of mocking of BA going on at the moment, that's actually a pretty easy situation to get into if you've expanded a DC - or even an regular equipment room - over several years without proper power managment, and BA is far from the first company to be caught out. So, if you are responsible for some IT equipment rooms, here's two things to consider; what's the combined total power draw of all the equipment in each room on power on (don't forget to include any UPS units topping up their batteries!), and what the maximum power load that can be supplied to each room? If you can't answer both of those, or at least be certain that the latter exceeds the former in each case, then you've potentially got exactly the same situation as BA.
  
  None of which excuses BA from not having the ability to successfully failover between redundant DCs in the event of a catastrophic outage at one facility, of course.
  
  Parent Share
  twitter facebook
  - Re:Did they try... (Score:5, Interesting)
    
    by Archangel Michael ( 180766 ) writes: on Friday June 02, 2017 @10:42AM (#54534479) Journal
    
    There are two ways to engineer power for a datacenter. 1) You can engineer for maximum efficiency/lowest cost or you can engineer for redundancy/max safety. Penny Pinchers always choose the former, and IT guys usually want the latter.
    Here is the real equation: Cost * likelihood of of catastrophic event. If you think 100,000 * a .0000001 chance of catastrophe, you err on the side of savings. On the other hand, if you think $25 * 100.00 chance of catastrophe, you err on the side of cost.
    My guess, is that they didn't account for business losses when plugging in that (obviously over simplified) formula. This is why you leave penny pinching idiots out of the decision making, because when all you see is cost, and don't properly evaluate the catastrophic losses in event of disaster, then you're just an idiot that nobody should listen to.
    I get that there are budgets and such, but here is my one question I (IT guy) ask the "business" decision makers: If you lost everything, how much would it cost you? Most people undervalue the data inside the databases and documents, because they have no way of quantifying how much all that data is worth.
    Data, is the biggest unaccounted for asset of a business.
    
    Parent Share
    twitter facebook
    - Re:Did they try... (Score:5, Insightful)
      
      by swb ( 14022 ) writes: on Friday June 02, 2017 @11:06AM (#54534751)
      
      I think they also suffer from what I call "efficiency savings hoarding".
      If you have a process that requires 10 labor inputs to achieve and you buy a machine that reduces it to 5 labor inputs, your ongoing savings isn't really 5 labor inputs. You have to spend some of that labor savings in keeping the machine maintained and operational and investing in its replacement when it reaches end of life.
      When I started working for a company in 1993, they had some 40 secretarial positions whose workload was about half spent doing correspondence and scheduling meetings. In 2001, thanks to widely deployed email/calendaring system they had cut about 30 of those positions because internal meetings could be automatically planned via email and the bulk of internal correspondence shifted from paper memos to email.
      Yet when it came time to expand/replace the email system due to growth it was seen as a "cost". I actually got the project approved by arguing that the cost of the replacement was actually being paid for by the savings realized from fewer administrative staff -- they still had ample savings (the project was less than 1 administrative FTE). But the efficiency gain from the project wasn't free on an ongoing basis.
      Too many business gain efficiencies and savings from automation, but assume these are permanent gains whose maintenance incurs no costs.
      I have an existing client with a large, internally developed kind of ERP system that supports a couple of thousand remote workers. The system is aging out (software versions, resources, performance issues all identified by their own internal developer) and of course the owner is balking at investing in it without realizing that the "free money" from reduced in-office staff needed to process faxes, etc, needs to be applied to maintaining the system to keep achieving the savings.
      
      Parent Share
      twitter facebook
      - Re: (Score:3)
        
        by PPH ( 736903 ) writes:
        
        The overarching problem is quantifying the value of the functions being performed. You can compare the costs of a clerical staff versus that of an e-mail/calendaring system. But it's difficult to figure out in a business setting what these functions are worth.
        My boss would wail and cry over the inability to peruse the individual schedules of all of his minions for the purpose of calling yet another self-aggrandizing staff meeting. And he would assign a very high value to this function. But back in the 'old
  - Re: (Score:2)
    
    by interkin3tic ( 1469267 ) writes:
    
    As someone who doesn't work in IT I have to ask, what are the chances of other big organizations learning from this? Are we talking other airlines will make sure they avoid the exact same scenario but don't bother putting any additional resources to other IT disasters, or are we talking other companies laugh at BA's customers and then cut IT support?
  - Re: (Score:2)
    
    by will_die ( 586523 ) writes:
    
    They could easily be well under maximum power draw in normal usage since the switch-on surge can be multiple times larger.
- Re: (Score:2)
  
  by PolygamousRanchKid ( 1290638 ) writes:
  
  "Holy Mother Of All Single Point of Failures, Batman!"
  Well, if the contractor is like some of the ones I know, he will justly say, "I was instructed to turn off the switch . . . not to turn it back on again!"
  Which brings to the obvious point: Which British Airways employee was responsible for the work being done? Blaming the lowly contractor is a complete shift of the blame to someone who obviously couldn't know any better.
  Or is British Airways an example of "Contractors . . . all the way down" . . .
- Re: (Score:2)
  
  by mspohr ( 589790 ) writes:
  
  Isn't that how you fix Windows computers?
- - Re:Did they try... (Score:5, Insightful)
    
    by sycodon ( 149926 ) writes: on Friday June 02, 2017 @10:34AM (#54534399)
    
    Bullshit.
    Even a brand new IT graduates knows computers should be plugged into UPS devices that protect against this.
    Handling power outages is about as basic of an IT task as they come. Basic Lock Out practices that prevent power from accidentally being turned off is also Server Maintenance 101.
    For this to actually have been the cause means their IT organization was run by rank amateurs.
    
    Parent Share
    twitter facebook
    - Re: (Score:2)
      
      by HornWumpus ( 783565 ) writes:
      
      A sort of similar thing happend to me, in 1991, running netmare, at a business location. Conscientious employee made sure she shut the office down at the end of business. I got to run out their the next morning.
      I fixed it with duct tape over the power switch ('server' was a desktop, AT power supply). Wrote 'touch this and die' on it in sharpie. Arranged to have 'server closet' locked, which was good for cutting down dust as well.
      Can I be BA director of IT now? I'm obviously better qualified, despite be
      - Re:Did they try... (Score:5, Interesting)
        
        by zeugma-amp ( 139862 ) writes: on Friday June 02, 2017 @01:22PM (#54536197) Homepage
        
        Many moons ago I was working in a datacenter and we had a crew in the hallway hanging wallpaper. In order to do so, they had to remove the box that normally covered the Emergency Power Cutoff switch (which was actually a Big Red Button) that would instantly drop power to the whole room. I'm sure you can guess where this is going...
        One of the paper hangers bumped into the BRB, and *poof*, there went the power to the room. In the data center, we were in the middle of a shift change. My coworkers and I were standing around discussing handoff, and whatnot. Suddently, we heard a huge Boom as a crapload of switches tripped all around us. Then we heard the drives and fans spinning down.
        As I said, this was a ways back. In the data center we had 12 HP-3000/70 minicomputers, a couple of VAX 11/780s, and a water-cooled IBM 3090 mainframe that were our main systems in the room. The disk drives on the HPs were disk packs of 16" platters sitting in drives the size of a small washtub. They produced a lot of noise. Each HP3K had about 8 or 9 of these things daisy-chained behind the system itself.
        The room was loud. All the time. Well, when the power dropped, all those drives started to spin down. We were all just kind of standing around looking at each other, not knowing what had happened. You could hear the pitch of all those drives winding down, becoming a lower and lower note, until finally - silence.
        Simon and Garfunkle had a song called "The Sound of Silence" many year even further back into the dim reaches of time from when all this was taking place. This was the first instance in my life when I really understood what silence actually sounded like. It was eerie. You never heard silence in the computer room. You have UPS, generators and all kind of other things to make sure you never actually heard silence.
        So, there we were, standing around with our mouths hanging open, and listening to the eerie silence. The moment broke, and we quickly determined what had happened. Rather than just cut the power back on, we went through and powered off all the drives and such so we could slowly bring everything back up in an orderly fashion.
        One thing that I learned that day was that HP-3000 minicomputers contain a battery designed to allow the things to ride through such catastrophes. Out of the twelve HPs, once we had powered back on all of the drives, nine of them just started executing their next instruction and continued on as if nothing whatsoever had happened. Three needed to be coldstarted, which wasn't a really big deal. Within 30 minutes or so of power being brutally disconnected we had all of them running smoothly, or at least on the way up.
        The two DECs weren't quite so resiliant, but after checking their dirty disks, they came back up as well.
        An IBM 3090 does not like to have it's power just cut off. It really doesn't. We ended up having issues and it took about 24 hours to return to normal operational status.
        The entire event was kind of cool to run through. Gave me a new respect for HP engineering. For many of our users, all they experienced was that their terminal froze for about 20 minutes, then continued on where it had stopped.
        I don't know if the paper hanger lost his job, but we lost several thousand user hours of time while they were sitting staring at their frozen terminals.
        It was certainly an interesting experience, and I'll never forget the Sound of Silence in the Computer Room.
        
        Parent Share
        twitter facebook
        
        Re: (Score:3)
        
        by sysrammer ( 446839 ) writes:
        
        Hello darkness my old friend...
    - Re: (Score:2)
      
      by __aaclcg7560 ( 824291 ) writes:
      
      Even a brand new IT graduates knows computers should be plugged into UPS devices that protect against this.
      I took a PC hardware course in college. The instructor had six "server" PCs on a cart plugged into a $5 power strip. He left the cart plugged after the last class for Friday. A electrical storm passed through over the weekend. Monday morning he found a blackened power strip and six dead PCs.
      Remember, kids, don't plug your "servers" into a $5 power strip and hope for the best.
      - Re:Did they try... (Score:5, Funny)
        
        by I'm New Around Here ( 1154723 ) writes: on Friday June 02, 2017 @11:27AM (#54534997)
        
        >
        Remember, kids, don't plug your "servers" into a $5 power strip and hope for the best.
        Yes, buy a Monster power strip for $50. It has gold plated vacuum tubes for effecient power control. ;^)
        
        Parent Share
        twitter facebook
    - Comment removed (Score:4, Insightful)
      
      by account_deleted ( 4530225 ) writes: on Friday June 02, 2017 @11:24AM (#54534959)
      
      Comment removed based on user account deletion
      
      Parent Share
      twitter facebook
    - Re:Did they try... (Score:5, Informative)
      
      by Drakonblayde ( 871676 ) writes: on Friday June 02, 2017 @11:32AM (#54535063)
      
      Not entirely true...
      So it depends on what kind of UPS you're employing. If it's the really big ones, you know, the ones the size of generators, then you don't plug stuff directly into them. They tend to be centralized and distribute power to PDU's that are in the racks themselves. The servers plug into the PDU's in the racks, and those PDU's have on/off switches. My fat ass has bumped the power switch on PDU's more than once trying to squeeze into tight spaces between racks. UPS's aren't employed to protect against human error, they're designed to protect against loss of main power.
      If you're data center is small enough, you can get away with UPS's mounted in the rack and plug your servers directly into them, but when you're talking about scale, that's just not feasible or cost effective.
      Somehow I doubt British Airways data center is of the 'couple cabinents in a colo variety' and they've probably got the big UPS setup
      Most likely the fault lies with whomever architected the data center. I'll bet either there's very little room between the racks, or the PDU's are mounted in a way they can be accidentally bumped (probably either mid rack or at the bottom). I personally have taken to mounting PDU's at the top of the rack on the backside just to minimize any potential human contact with them.
      
      Parent Share
      twitter facebook
      - Re:Did they try... (Score:5, Insightful)
        
        by ghoul ( 157158 ) writes: on Friday June 02, 2017 @01:29PM (#54536291)
        
        Managers get paid to take the blame and the stress while workers get paid to do the work.
        
        Parent Share
        twitter facebook
    - Re:Did they try... (Score:4, Informative)
      
      by lactose99 ( 71132 ) writes: on Friday June 02, 2017 @02:04PM (#54536645)
      
      Even a brand new IT graduates knows computers should be plugged into UPS devices that protect against this.
      And you'd be surprised how many shops think this knowledge is someone else's problem and subsequently don't add it to any server installation docs, then look for a scapegoat when systems go tits-up like this.
      
      Parent Share
      twitter facebook
  - Re: (Score:3)
    
    by DickBreath ( 207180 ) writes:
    
    In Open Compute [opencompute.org] I believe the power supplies wait random amount of time before applying power to the rails to fire up the load. What an idea. You auto-magically spread out the time of the start up load over a short time.
    - Re: (Score:3)
      
      by fuzzywig ( 208937 ) writes:
      
      Many BIOS(/EFI) have an option to delay the harddrive spin up, so they don't all demand spin-up power from the PSU at the same time.
    - Re: (Score:3)
      
      by Dunbal ( 464142 ) * writes:
      
      Many electrical companies bill industrial customers based on PEAK power consumption, so it's in your interest to spread the load as widely as possible.
Am I in the Matrix? (Score:2)

by DontBeAMoran ( 4843879 ) writes:

Because I'm having déjà vu [slashdot.org].
- Re: (Score:2, Insightful)
  
  by Anonymous Coward writes:
  
  The new article has more details.
- Re: (Score:2)
  
  by nedlohs ( 1335013 ) writes:
  
  If you can't see the difference in the two articles you have bigger problems than being in the matrix.
  - Re: (Score:2)
    
    by DontBeAMoran ( 4843879 ) writes:
    
    The difference is British Airways shifting the blame to someone else.
    - Re: (Score:2)
      
      by nedlohs ( 1335013 ) writes:
      
      Right, the story has been updated and so news websites (and sites that pretend to be news websites) post a new article about it. Slashdot is great at dupes, this isn't one though.
LOL (Score:2)

by nospam007 ( 722110 ) * writes:

Seems like this 'test' to see if the UPS would kick in didn't work.
So the CEO _should_ resign after all.
- Re: LOL (Score:5, Insightful)
  
  by haemish ( 28576 ) writes: on Friday June 02, 2017 @10:13AM (#54534161)
  
  Right. It's not the poor guy that turned off the power supply. It's the shit-for-brains managrrs who wouldn't let the engineers put in redundant power supplies and hired cheap lobour that had no clue how to architect for fault tolerance.
  
  Parent Share
  twitter facebook
  - Re: LOL (Score:5, Insightful)
    
    by thegarbz ( 1787294 ) writes: on Friday June 02, 2017 @11:05AM (#54534741)
    
    who wouldn't let the engineers put in redundant power supplies
    That's an interesting assumption. Have you seen anything even remotely indicating that the data centre didn't have redundant power? No amount of redundancy has ever withstood some numbnuts pushing a button. But i'm interested to see your knowledge of the detailed design of this datacentre.
    Hell we had an outage on a 6kV dual fed sub the other day thanks to someone in another substation working on a wrong circuit. He was testing intertrips to a completely different substation, applying some power to an intertrip signal, realising he hit the wrong circuit (A), he immediately moved to the one he was supposed to do (B), both in the wrong cubicle successfully knocking out both redundant feeds to a 6kV sub and taking down a portion of the chemical plant in the process.
    Not sure what's worse, managers who don't put in redundant power, or armchair engineers who just *assume* that they didn't because redundant power can't ever go out.
    
    Parent Share
    twitter facebook
    - Re: (Score:2)
      
      by account_deleted ( 4530225 ) writes:
      
      Comment removed based on user account deletion
    - Re: (Score:2)
      
      by gmack ( 197796 ) writes:
      
      The place I work at work would have been fine with that scenario. They have two separate power feeds per rack going to two separate UPS systems going to two separate generators.
      There is no one switch that would take out the entire facility
    - Re: LOL (Score:4, Insightful)
      
      by chispito ( 1870390 ) writes: on Friday June 02, 2017 @11:50AM (#54535235)
      
      Not sure what's worse, managers who don't put in redundant power, or armchair engineers who just *assume* that they didn't because redundant power can't ever go out.
      It isn't armchair engineering. The CEO should accept full responsibility because that's what it means to be at the top of the reporting chain when such a devastating preventable outage occurs. If he was misled by his direct reports, then he should fire them and take full responsibility for not firing them sooner. Maybe he resigns maybe he doesn't--the point is that he must own the failure, whatever the logical conclusion.
      
      Parent Share
      twitter facebook
      - Re: (Score:3)
        
        by Dunbal ( 464142 ) * writes:
        
        The CEO should accept full responsibility
        Hah, the CEO is probably trying to figure out how to give himself more stock options now that they're cheaper. These greedy fuckers can never think past their multi-million payouts.
- - Re: (Score:2)
    
    by sycodon ( 149926 ) writes:
    
    He should resign because he apparently relied in a single UPS.
Bright side (Score:3)

by SpaghettiPattern ( 609814 ) writes: on Friday June 02, 2017 @10:10AM (#54534127)

Floor got cleaned cheaply and everyone got home early. Long live outsourcing!
Of course I didn't RTFA! With respect to outsourcing there's no difference between strategic and daily tasks like cleaning and strategic planning. Both need to be done short and long term. I can understand outsourcing occasional tasks but daily and strategic stuff will always be needed. Outsourcing of those tasks is a sign of utterly bad management.

Share
twitter facebook
- Re: (Score:2)
  
  by avandesande ( 143899 ) writes:
  
  If you look at the timeline of events it's highly unlikely that outsourced individuals designed this fault.
- Keep your core competencies in house (Score:3)
  
  by raymorris ( 2726007 ) writes:
  
  What definitely needs to be done in-house is whatever your company is supposed to be good at. Ford designs and assembles cars - they shouldn't outsource the design and assembly of cars because that's what they DO - if they stop making cars, they are no longer doing anything and have no reason to exist. Ford is not in the business of making cleaning products, so they probably shouldn't make the cleaning products they use. They should outsource that, buying cleaning products from SC Johnson or someone. Ford i
Out of band (Score:2)

by MikeB0Lton ( 962403 ) writes:

I guess it cost too much to add monitoring and remote management.
N+1 guess not (Score:4, Insightful)

by silas_moeckel ( 234313 ) writes: <silas AT dsminc-corp DOT com> on Friday June 02, 2017 @10:13AM (#54534157) Homepage

So it was all running in a single DC with a single power bus? Plenty of room at real datacenters they need to stop running out of a closet somewhere.

Share
twitter facebook
Yeah, yeah... blame the contractor... (Score:5, Insightful)

by __aaclcg7560 ( 824291 ) writes: on Friday June 02, 2017 @10:14AM (#54534169)

This is human error because a contractor accidentally turned off a power supply that caused a world-wide outage? It should be operational error for allowing such a single-point of failure to exist.

Share
twitter facebook
- - Re: (Score:2)
    
    by __aaclcg7560 ( 824291 ) writes:
    
    Design error you probably mean?
    Operational error. Backup systems need to be periodically checked to see if they still working as designed. If the backup system got tested and failed to work, then it would then be a design error.
What the heck does this switch do? (Score:5, Funny)

by JoeyRox ( 2711699 ) writes: on Friday June 02, 2017 @10:14AM (#54534171)

No sure Bob - just flip it so that we can go get some lunch. I'm starving.

Share
twitter facebook
- Re: (Score:2)
  
  by interkin3tic ( 1469267 ) writes:
  
  Maybe it was more like this? [youtube.com]
- Re: (Score:3)
  
  by LordWabbit2 ( 2440804 ) writes:
  
  Heh, you joke, but we had a server in our server room no one was using any more, it was under powered (ie. old) we had all gotten our stuff off of it and thought we might as well shut it down. So we did. Got a call a couple days later from across the country, "WTF happened to our XYZ?". So we switched it on again. No one knew wtf they were doing on/with the server, and our manager didn't even try to find out, he just said "Well leave it on then". It's probably still sitting there quietly doing whatever
Stephen Stucker unavailable for comment (Score:3)

by RogueWarrior65 ( 678876 ) writes: on Friday June 02, 2017 @10:15AM (#54534183)

"Just kidding!"

Share
twitter facebook
Root Cause (Score:2)

by FerociousFerret ( 533780 ) writes:

I found the culprit: https://youtu.be/9WYGdstEVJQ?t... [youtu.be]
What they MEANT to say is that. . . (Score:2)

by Salgak1 ( 20136 ) writes:

. . . . the power was turned off by a FORMER contractor.
Then again, BA probably promoted him to executive VP.. .
Human Error? Sue but Still... (Score:2)

by bobbied ( 2522392 ) writes:

Human Error accounts for 99% of actual power outages in my experience. It's ALWAYS some idiot throwing the wrong switch, unplugging the wrong thing, yanking the wrong wires or spilling something in the wrong place...
You simply cannot engineer around stupid well enough to fix it, regardless of how hard you try..
That being said... For a mission critical system in a multi-million dollar company like BA where was the backup site in a different geographic location that was configured to take over in the not-so-
- This is the ultimate single point of failure. (Score:2)
  
  by SuperKendall ( 25149 ) writes:
  
  The first thing I think of is anything happening at tat location - flood, bomb, larger grid outage lasting more than a day or so - and BA is finished.
  Heck if you were a terrorist now you know exactly where to attack that would truly hose an entire company that brings in a lot of money (and people) to England...
- Re: (Score:2)
  
  by ghoul ( 157158 ) writes:
  
  They had an offsite DR. The DR was setup wrong and did not have the latest data so when they switched to it they started seeing wrong data and had to switch it off
  - Re: (Score:2)
    
    by HornWumpus ( 783565 ) writes:
    
    They thought they had DR. They were wrong.
    Somebody responsible should have signed off on the plans and routine testing schedule for that. It is a key job responsibility.
  - Re: (Score:2)
    
    by turbidostato ( 878842 ) writes:
    
    "They had an offsite DR. The DR was setup wrong and did not have the latest data so when they switched to it they started seeing wrong data and had to switch it off"
    In the kind of companies that outsource the hell out to save some pennies there are two and only two types of highly available systems:
    1) Active/Passive. When the active goes nuts the passive fails to start for whatever reasons (tightly coupled to the fact that the system was tested exactly once, in the happy path, when given the "operational"
- Re: (Score:2)
  
  by PolygamousRanchKid ( 1290638 ) writes:
  
  You simply cannot engineer around stupid well enough to fix it, regardless of how hard you try..
  Nothing can be made foolproof . . . because fools are so ingenious."
not the contractor's fault (Score:5, Insightful)

by ooloorie ( 4394035 ) writes: on Friday June 02, 2017 @10:20AM (#54534245)

When your business depends on your IT infrastructure like that, turning off the power to a single machine or data center shouldn't bring down your operation; that's just stupid and bad design. Good enterprise software provides resilience, automatic failover, and geographically distributed operations. Companies need to use that.
And they should actually have tests every few months where they do shut down parts of their infrastructure randomly.

Share
twitter facebook
- Re: (Score:2)
  
  by crow ( 16139 ) writes:
  
  Systems in a data center should have two different power systems. The contractor shut one of them down to do some work. That should have been fine. I would guess that the work was to replace or repair some of the power infrastructure. The most likely situation here is that the contractor switched off the wrong one, and the correct one was already off (possibly due to the failure for which the contractor was called in the first place, or else someone had already shut it off for him).
  Process errors like t
- Re: (Score:2)
  
  by h4ck7h3p14n37 ( 926070 ) writes:
  
  And they should actually have tests every few months where they do shut down parts of their infrastructure randomly.
  A place I used to work at had an outage at their main data center because of a scheduled test of the power system.
  The week prior they had an outage because the APC battery-backup system in the server room developed a short. One of the engineers flipped the bypass switch while the APC tech was fixing the problem.
  There was a diesel generator next to the building that was used in case the building itself lost its electrical connection and a power-on test was performed every two weeks.
  Sure enough, the power
Mental picture from the movie 'Airplane' (Score:3)

by DirkDaring ( 91233 ) writes: on Friday June 02, 2017 @10:26AM (#54534303)

of Johnny unplugging the extension cord from the wall and the lights on the runway going out. "Just kidding!"
https://datacenteroverlords.files.wordpress.com/2017/01/airplane.jpg

Share
twitter facebook
Yeah, sure (Score:2)

by Rik Sweeney ( 471717 ) writes:

This is just one step up from the cleaner killing a patient because they unplugged the life support machine to vacuum in the room.
Pull the other one, it's got bells on it.
Why the power went out is unimportant (Score:2)

by minus9 ( 106327 ) writes:

The more important question is why it took the best part of two days to get things up and running again.

As for the power outage - A UPS test to check if power transferred to battery/generator that failed maybe?
- Re: (Score:2)
  
  by HornWumpus ( 783565 ) writes:
  
  Ship a tape to the backup site and wait for it to restore. No online backups at that site.
A bigger boy did it and ran away... (Score:2)

by JonnyCalcutta ( 524825 ) writes:

Sounds like a load of baloney to me and really explains nothing. Sounds, in fact, like a cover up from someone who doesn't understand the implications of their lie.
It still doesn't explain why everything went down so catastrophically. Why was there only one power source? What about back up servers and other redundant systems? Why was it so easy for a contractor to switch the power off? Was he following procedure. What about redundancy? Why couldn't he just switch it back on again (I know, but if its such a
Comment removed (Score:5, Interesting)

by account_deleted ( 4530225 ) writes: on Friday June 02, 2017 @10:45AM (#54534523)

Comment removed based on user account deletion

Share
twitter facebook
- Re:How does one DR test in a 24/7 business? (Score:5, Insightful)
  
  by silas_moeckel ( 234313 ) writes: <silas AT dsminc-corp DOT com> on Friday June 02, 2017 @10:53AM (#54534597) Homepage
  
  You do it in production because none of it should cause a massive failure. They bought a DR site and failed to test it. Working at some big shops the DR site was prod every other quarter.
  
  Parent Share
  twitter facebook
- Re: (Score:3)
  
  by will_die ( 586523 ) writes:
  
  We actually test ours by failing over portions each month and making sure everything works.
  For a smaller place I worked which had a limited DR(not everything failed over) parts were tested on a monthly bases, everything was tested yearly with a planned failure that was also to ensure the users had training.
  Some DR stuff also now is really nice in that when you tell it to self-test it creates a separate network so you can test the installation at the COOP site.
- Re: (Score:3)
  
  by jader3rd ( 2222716 ) writes:
  
  How does one actually fail-over test things in production in a 24/7 business
  You eliminate any distinction between maintenance operations and DR. The redundant systems should behave the same during upgrade/patching of one of the nodes, a disk dying on one of the nodes, a node hosting active client connections has its NIC die, having a rack die, having the WAN cut, having the entire datacenter lose power, etc.
  If the underlying redundancy system doesn't significantly differentiate discretionary failover operations from DR failover situations, you can run a 24/7 system.
  See Exchange Dat
- Re: (Score:3)
  
  by Bob the Super Hamste ( 1152367 ) writes:
  
  Easily. Regularly switching to the backup site should be done as part of the day to day business operations. For example at my job I work with a company that will switch daily between the main and backup system. It doesn't hurt that the main and backup are running in a hot standby configuration and the backup can take over at a moments notice. They also have 2 additional systems for further levels of redundancy. One is a system that they do a system restore to each day (the previous backup of the main syste
Did they try to turn it off and on again? (Score:2)

by grumpy-cowboy ( 4342983 ) writes:

:)
You can only idiot-proof so much (Score:2)

by ErichTheRed ( 39327 ) writes:

It's good practice to make things so simple that no one could possibly mess them up. It works in programming - look at how many JavaScript frameworks abstract an already sandboxed development environment to a point where "signalling intent" is basically all the developer needs to do. Or in hardware -- we're using HPE servers and there is literally a "don't remove this drive" light that comes on when a drive fails in a RAID set. That had to be a customer-requested change after one too many data-loss events s
he is called to the CEO's office (Score:2)

by tommeke100 ( 755660 ) writes:

contractor: "so, I guess I'm pretty much done with this company right?"
CEO: "Not at all! We just spend 1 billion $ educating you!"
contractor in tears: "oh thank you"
CEO: "I was joking, dumbass. This is the real world. You're fired and we're going to sue you for 2 billion $".
Single point of failure? (Score:2)

by mspohr ( 589790 ) writes:

Shit happens and most competent companies plan for it by have redundant live backup systems.
I can't believe that BA didn't have a live backup system at another site to fail over to.
Really, this costs money but these cheap bastards don't seem to have a clue.
No HA? (Score:3)

by elistan ( 578864 ) writes: on Friday June 02, 2017 @11:33AM (#54535069)

Business critical systems should operate in an active/active high-availability scenario in at least two separate locations. That way the loss of any one node has zero effect except perhaps a transaction retry and reduced performance.

Systems of the next lower level of criticality should have real-time replication to a separate location, so that if a node fails the recovery time is simply what it takes to boot the replacement node.

A further lower levels of criticality you start getting into things like virtualization clusters to mitigate hardware failures supported by point-in-time backups to mitigate data failures. The IT department's Minecraft server can just be a spare desktop machine sitting on an admin's desk.

(There are additional considerations for all levels of criticality too, of course, like SAN volume snapshots, and backups too of course.)

Share
twitter facebook
Oops (Score:3)

by TheDarkener ( 198348 ) writes: on Friday June 02, 2017 @12:16PM (#54535487) Homepage

(Hopefully) an honest, albeit very consequential mistake. I've done the same thing when I was working on the backside of a server cabinet - the PDU was right there by my shoulder and I swiped it on accident. No UPS in the cabinet (a mistake not of my own but the ones who built it out). Fortunately everything came back on. Good thing to have BIOS settings to 'stay off' after a power failure (so you can turn them back on individually and not overdraw power). I feel bad for the guy who did this, it was probably his last day working there.

Share
twitter facebook
And we should believe this? (Score:3)

by whitroth ( 9367 ) writes: <whitroth@@@5-cent...us> on Friday June 02, 2017 @12:32PM (#54535619) Homepage

In our secure rooms, we have an EPO button. It's LARGE, red, and inside a cover that you have to lift to turn hit.
And this contractor turned off the *entire* power for an *entire* datacenter? Yep, yep, not our fault, not your fault, it's gotta be the fault of that guy over there pushin' a broom!

Share
twitter facebook
- Re:And we should believe this? (Score:4, Informative)
  
  by markana ( 152984 ) writes: on Friday June 02, 2017 @02:07PM (#54536699)
  
  We had an entire data center shut down this way. Facilities *insisted* that the BRB (Big Red Button) not have any sort of shroud or cover over it. Just in case someone couldn't figure out how to get to the button in a dire emergency.
  So one day, they've got a clueless photographer taking pictures of the racks. He was backing up to frame the perfect framing and... we'll, you can guess the rest.
  Now, the button has a shroud that you have to reach into to hit it, and non-essential personnel are banned from the rooms. Total cost of the outage (even with the geo-redundant systems kicking in) was over $1M.
  Just another day in the life of IT.
  
  Parent Share
  twitter facebook
And _more_ lies! (Score:3)

by gweihir ( 88907 ) writes: on Friday June 02, 2017 @01:00PM (#54535921)

Sure, that may have been the proverbial last drop. But the actual root-cause is that their systems were not able to cope with outages that must be expected. And the responsibility for that is straight with top management. Their utterly dishonest smoke-screen is just more proof that they should be removed immediately for gross incompetence.

Share
twitter facebook
Been there ... (Score:4, Interesting)

by CaptainDork ( 3678879 ) writes: on Friday June 02, 2017 @01:06PM (#54535981)

... I was walking behind the server rack and unknowingly brushed up against the power cord to the Novell 3.1 server.
Later, when my boss asked me for an outage report, I told her, "I wish you hadn't asked that."
I made damned sure that plug was tied to the server after that.

Share
twitter facebook
- Re: (Score:3)
  
  by bill_mcgonigle ( 4333 ) * writes:
  
  they didn't just switch over to their DR site.
  You forgot the mic drop.
- Re: (Score:2)
  
  by queazocotal ( 915608 ) writes:
  
  That doesn't help if there is one master switch, in case of (for example) fire, and he activated it.
  - Re: (Score:2)
    
    by __aaclcg7560 ( 824291 ) writes:
    
    That doesn't help if there is one master switch, in case of (for example) fire, and he activated it.
    More like a extension cord stretched across a busy walkway just waiting for someone to trip on it.
    - Re: (Score:2)
      
      by queazocotal ( 915608 ) writes:
      
      At the moment, there is no reasonable way to tell between various scenarios.
      It could go all the way from 'worker pressed big red button despite being told not to, signs telling him not to, and having signed an agreement not to', to 'worker followed what they believed was procedure and did what 99% of people would have done', to 'worker did precisely as instructed and are being scapegoated'.
      - Re:How is this a thing (Score:4, Funny)
        
        by Archangel Michael ( 180766 ) writes: on Friday June 02, 2017 @10:44AM (#54534517) Journal
        
        Worker: The sign says "Do not use"
        Manager: I don't care what it says, flip the switch
        Worker: That's a really stupid idea
        Manager: Do it, or you're fired
        Worker:
        Manager: Well, now you really screwed things up, you're fired!
        
        Parent Share
        twitter facebook
      - Re: (Score:2)
        
        by Required Snark ( 1702878 ) writes:
        
        There is one thing we do know: ultimately this was a management failure, not a tech/operations failure. A cascade failure from a single point is bad, but it inevitably follows from bad management. Read about the decision making before Fukushima or the Challenger disaster for examples. Someone always speaks at some point before it all goes up in flames, and they are ignored.
        We also know one other thing: no one up in management will accept responsibility. All upper managers will be shielded from personal res
        
        Re: (Score:2)
        
        by queazocotal ( 915608 ) writes:
        
        No, it isn't.
        Sometimes someone, despite proper training, management, and instruction does something that goes against all of that training, to the point that no reasonable person given the same instruction and training would have done the same thing.
        In some cases you actually do need emergency global 'off' switches that are never meant to be used in normal operation.
        
        Re: (Score:2)
        
        by account_deleted ( 4530225 ) writes:
        
        Comment removed based on user account deletion
        
        Re: (Score:3)
        
        by thsths ( 31372 ) writes:
        
        > In some cases you actually do need emergency global 'off' switches that are never meant to be used in normal operation.
        Yes, if you run a simple experiment, and there is the possibility for harm, a single red button is a good idea.
        But if shutting down the server room costs $100 000 000, then a single red button is not a good idea. Instead, you have two parallel power distribution system, with some physical separation, and there are two off switches. Of course there should be sign that explains how to us
  - Re: (Score:2)
    
    by sycodon ( 149926 ) writes:
    
    Switches such as that should be locked out, requiring multiple people to allow access.
    If you have a switch like that accessible so that just anyone can flick it off, you are an idiot.
    - Re: (Score:2)
      
      by queazocotal ( 915608 ) writes:
      
      'flick it off' may include 'opened the interlocks and keyed in the code as he believed he was doing the correct thing'.
      This could be a personal failure due to stupidity, a training failure, or he was in fact instructed to turn it off, and though he protested, is now getting scapegoated.
    - Re: (Score:2)
      
      by nedlohs ( 1335013 ) writes:
      
      If you have a switch like that accessible so that just anyone can flick it off, you are an idiot.
      If you don't you are probably in violation of the local fire codes. Though the 2011 updates to the NEC did remove the "shall be readily accessible at the principal exit door" language from the emergency power off requirements instead allowing "shall be located at approved locations readily accessible in case of fire to authorized personnel and emergency responders", so a bunch of jurisdictions will just be using

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Did they try... (Score:5, Funny)

Re: Did they try... (Score:5, Funny)

Re: (Score:2)

Re: Did they try... (Score:5, Interesting)

Re: Did they try... (Score:5, Insightful)

Re: Did they try... (Score:5, Interesting)

Re: (Score:3)

Re: Did they try... (Score:5, Funny)

Re: (Score:3)

Re: Did they try... (Score:5, Insightful)

Re: Did they try... (Score:4, Insightful)

Re: (Score:3)

Re:Did they try... (Score:5, Interesting)

Re:Did they try... (Score:5, Interesting)

Re:Did they try... (Score:5, Insightful)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:Did they try... (Score:5, Insightful)

Re: (Score:2)

Re:Did they try... (Score:5, Interesting)

Re: (Score:3)

Re: (Score:2)

Re:Did they try... (Score:5, Funny)

Comment removed (Score:4, Insightful)

Re:Did they try... (Score:5, Informative)

Re:Did they try... (Score:5, Insightful)

Re:Did they try... (Score:4, Informative)

Re: (Score:3)

Re: (Score:3)

Re: (Score:3)

Am I in the Matrix? (Score:2)

Re: (Score:2, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

LOL (Score:2)

Re: LOL (Score:5, Insightful)

Re: LOL (Score:5, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: LOL (Score:4, Insightful)

Re: (Score:3)

Re: (Score:2)

Bright side (Score:3)

Re: (Score:2)

Keep your core competencies in house (Score:3)

Out of band (Score:2)

N+1 guess not (Score:4, Insightful)

Yeah, yeah... blame the contractor... (Score:5, Insightful)

Re: (Score:2)

What the heck does this switch do? (Score:5, Funny)

Re: (Score:2)

Re: (Score:3)

Stephen Stucker unavailable for comment (Score:3)

Root Cause (Score:2)

What they MEANT to say is that. . . (Score:2)

Human Error? Sue but Still... (Score:2)

This is the ultimate single point of failure. (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

not the contractor's fault (Score:5, Insightful)

Re: (Score:2)

Re: (Score:2)

Mental picture from the movie 'Airplane' (Score:3)

Yeah, sure (Score:2)

Why the power went out is unimportant (Score:2)

Re: (Score:2)

A bigger boy did it and ran away... (Score:2)

Comment removed (Score:5, Interesting)

Re:How does one DR test in a 24/7 business? (Score:5, Insightful)

Re: (Score:3)

Re: (Score:3)

Re: (Score:3)

Did they try to turn it off and on again? (Score:2)

You can only idiot-proof so much (Score:2)