Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Transportation IT Technology

British Airways IT Outage Caused By Contractor Who Accidentally Switched off Power (independent.ie) 262

An anonymous reader shares a report: A contractor doing maintenance work at a British Airways data centre inadvertently switched off the power supply, knocking out the airline's computer systems and leaving 75,000 people stranded last weekend, according to reports. A BA source told The Times the power supply unit that sparked the IT failure was working perfectly but was accidentally shut down by a worker.
This discussion has been archived. No new comments can be posted.

British Airways IT Outage Caused By Contractor Who Accidentally Switched off Power

Comments Filter:
  • by 110010001000 ( 697113 ) on Friday June 02, 2017 @09:03AM (#54534069) Homepage Journal
    ...turning it on again?
    • by Anonymous Coward on Friday June 02, 2017 @09:16AM (#54534189)

      text book example of a "career changing event"

      • The individual will merely have to change which contractor employs them. In this world, education is now proportional to the number of contractors you've been employed by. Sad. Terrible.
      • Re: Did they try... (Score:5, Interesting)

        by __aaclcg7560 ( 824291 ) on Friday June 02, 2017 @09:57AM (#54534673)

        text book example of a "career changing event"

        Not necessarily. I've been in a few of these "career changing event" over the years. If I make a mistake, I step forward, take responsibility and fix the problem (if I can). Mangers are less likely to punish someone who comes forward immediately. In other cases where blame must be assigned, I've already documented my actions and sometimes the action of those around me. If my CYA is stronger than everyone else's, I'm not going to get blame for something that I didn't do.

        • by LS1 Brains ( 1054672 ) on Friday June 02, 2017 @10:33AM (#54535071)

          I've been in a few of these "career changing event" over the years. If I make a mistake, I step forward, take responsibility and fix the problem (if I can).

          As an IT Manager/Director, THANK YOU. Everyone screws up at some point, it's what you do after that really matters.

      • by jellomizer ( 103300 ) on Friday June 02, 2017 @12:00PM (#54535919)

        You mean for the Executive who didn't approve of the hot offsite fail over solution ?

        You know the stuff that normal large organizations have to make sure their business can be operational.

      • Resume line-item:

          - Single-handedly tested entire DR operation of British Airways

    • Re:Did they try... (Score:5, Interesting)

      by Zocalo ( 252965 ) on Friday June 02, 2017 @09:28AM (#54534333) Homepage
      Apparently that was what led to the major outage turning into a prolonged major outage. It seems that the sequence of events is now that the contractor turned off the power (presumably killing one of more phases), obviously leading to a large scale hardware shutdown. Someone (the same contractor, most likely) tried to restore power, as you would, only to find that the power surge of all that hardware switching back on at the same time, which often means that they are close to maximum power draw, overloaded the system and caused physical damage to the hardware.

      While a there's a lot of mocking of BA going on at the moment, that's actually a pretty easy situation to get into if you've expanded a DC - or even an regular equipment room - over several years without proper power managment, and BA is far from the first company to be caught out. So, if you are responsible for some IT equipment rooms, here's two things to consider; what's the combined total power draw of all the equipment in each room on power on (don't forget to include any UPS units topping up their batteries!), and what the maximum power load that can be supplied to each room? If you can't answer both of those, or at least be certain that the latter exceeds the former in each case, then you've potentially got exactly the same situation as BA.

      None of which excuses BA from not having the ability to successfully failover between redundant DCs in the event of a catastrophic outage at one facility, of course.
      • Re:Did they try... (Score:5, Interesting)

        by Archangel Michael ( 180766 ) on Friday June 02, 2017 @09:42AM (#54534479) Journal

        There are two ways to engineer power for a datacenter. 1) You can engineer for maximum efficiency/lowest cost or you can engineer for redundancy/max safety. Penny Pinchers always choose the former, and IT guys usually want the latter.

        Here is the real equation: Cost * likelihood of of catastrophic event. If you think 100,000 * a .0000001 chance of catastrophe, you err on the side of savings. On the other hand, if you think $25 * 100.00 chance of catastrophe, you err on the side of cost.

        My guess, is that they didn't account for business losses when plugging in that (obviously over simplified) formula. This is why you leave penny pinching idiots out of the decision making, because when all you see is cost, and don't properly evaluate the catastrophic losses in event of disaster, then you're just an idiot that nobody should listen to.

        I get that there are budgets and such, but here is my one question I (IT guy) ask the "business" decision makers: If you lost everything, how much would it cost you? Most people undervalue the data inside the databases and documents, because they have no way of quantifying how much all that data is worth.

        Data, is the biggest unaccounted for asset of a business.

        • Re:Did they try... (Score:5, Insightful)

          by swb ( 14022 ) on Friday June 02, 2017 @10:06AM (#54534751)

          I think they also suffer from what I call "efficiency savings hoarding".

          If you have a process that requires 10 labor inputs to achieve and you buy a machine that reduces it to 5 labor inputs, your ongoing savings isn't really 5 labor inputs. You have to spend some of that labor savings in keeping the machine maintained and operational and investing in its replacement when it reaches end of life.

          When I started working for a company in 1993, they had some 40 secretarial positions whose workload was about half spent doing correspondence and scheduling meetings. In 2001, thanks to widely deployed email/calendaring system they had cut about 30 of those positions because internal meetings could be automatically planned via email and the bulk of internal correspondence shifted from paper memos to email.

          Yet when it came time to expand/replace the email system due to growth it was seen as a "cost". I actually got the project approved by arguing that the cost of the replacement was actually being paid for by the savings realized from fewer administrative staff -- they still had ample savings (the project was less than 1 administrative FTE). But the efficiency gain from the project wasn't free on an ongoing basis.

          Too many business gain efficiencies and savings from automation, but assume these are permanent gains whose maintenance incurs no costs.

          I have an existing client with a large, internally developed kind of ERP system that supports a couple of thousand remote workers. The system is aging out (software versions, resources, performance issues all identified by their own internal developer) and of course the owner is balking at investing in it without realizing that the "free money" from reduced in-office staff needed to process faxes, etc, needs to be applied to maintaining the system to keep achieving the savings.

          • by PPH ( 736903 )

            The overarching problem is quantifying the value of the functions being performed. You can compare the costs of a clerical staff versus that of an e-mail/calendaring system. But it's difficult to figure out in a business setting what these functions are worth.

            My boss would wail and cry over the inability to peruse the individual schedules of all of his minions for the purpose of calling yet another self-aggrandizing staff meeting. And he would assign a very high value to this function. But back in the 'old

      • As someone who doesn't work in IT I have to ask, what are the chances of other big organizations learning from this? Are we talking other airlines will make sure they avoid the exact same scenario but don't bother putting any additional resources to other IT disasters, or are we talking other companies laugh at BA's customers and then cut IT support?
      • They could easily be well under maximum power draw in normal usage since the switch-on surge can be multiple times larger.
    • "Holy Mother Of All Single Point of Failures, Batman!"

      Well, if the contractor is like some of the ones I know, he will justly say, "I was instructed to turn off the switch . . . not to turn it back on again!"

      Which brings to the obvious point: Which British Airways employee was responsible for the work being done? Blaming the lowly contractor is a complete shift of the blame to someone who obviously couldn't know any better.

      Or is British Airways an example of "Contractors . . . all the way down" . . .

    • by mspohr ( 589790 )

      Isn't that how you fix Windows computers?

  • Because I'm having déjà vu [slashdot.org].

    • Re: (Score:2, Insightful)

      by Anonymous Coward
      The new article has more details.
    • If you can't see the difference in the two articles you have bigger problems than being in the matrix.

      • The difference is British Airways shifting the blame to someone else.

        • Right, the story has been updated and so news websites (and sites that pretend to be news websites) post a new article about it. Slashdot is great at dupes, this isn't one though.

  • Seems like this 'test' to see if the UPS would kick in didn't work.

    So the CEO _should_ resign after all.

    • Re: LOL (Score:5, Insightful)

      by haemish ( 28576 ) on Friday June 02, 2017 @09:13AM (#54534161)

      Right. It's not the poor guy that turned off the power supply. It's the shit-for-brains managrrs who wouldn't let the engineers put in redundant power supplies and hired cheap lobour that had no clue how to architect for fault tolerance.

      • Re: LOL (Score:5, Insightful)

        by thegarbz ( 1787294 ) on Friday June 02, 2017 @10:05AM (#54534741)

        who wouldn't let the engineers put in redundant power supplies

        That's an interesting assumption. Have you seen anything even remotely indicating that the data centre didn't have redundant power? No amount of redundancy has ever withstood some numbnuts pushing a button. But i'm interested to see your knowledge of the detailed design of this datacentre.

        Hell we had an outage on a 6kV dual fed sub the other day thanks to someone in another substation working on a wrong circuit. He was testing intertrips to a completely different substation, applying some power to an intertrip signal, realising he hit the wrong circuit (A), he immediately moved to the one he was supposed to do (B), both in the wrong cubicle successfully knocking out both redundant feeds to a 6kV sub and taking down a portion of the chemical plant in the process.

        Not sure what's worse, managers who don't put in redundant power, or armchair engineers who just *assume* that they didn't because redundant power can't ever go out.

        • I agree with you that the OPs assumption reads a lot of world-view into a statement that is fairly light on facts. However, there is also probably some truth. A single data center failure shouldn't have caused such an outage. There should be a redundant data center. So, yes, it's reasonable for somebody to accidentally shut off all the power to a data center or for a natural disaster to wipe it out. But it's not reasonable for an operation the size of BA not to have redundant servers somewhere.
        • by gmack ( 197796 )

          The place I work at work would have been fine with that scenario. They have two separate power feeds per rack going to two separate UPS systems going to two separate generators.

          There is no one switch that would take out the entire facility

        • Re: LOL (Score:4, Insightful)

          by chispito ( 1870390 ) on Friday June 02, 2017 @10:50AM (#54535235)

          Not sure what's worse, managers who don't put in redundant power, or armchair engineers who just *assume* that they didn't because redundant power can't ever go out.

          It isn't armchair engineering. The CEO should accept full responsibility because that's what it means to be at the top of the reporting chain when such a devastating preventable outage occurs. If he was misled by his direct reports, then he should fire them and take full responsibility for not firing them sooner. Maybe he resigns maybe he doesn't--the point is that he must own the failure, whatever the logical conclusion.

          • by Dunbal ( 464142 ) *

            The CEO should accept full responsibility

            Hah, the CEO is probably trying to figure out how to give himself more stock options now that they're cheaper. These greedy fuckers can never think past their multi-million payouts.

  • by SpaghettiPattern ( 609814 ) on Friday June 02, 2017 @09:10AM (#54534127)

    Floor got cleaned cheaply and everyone got home early. Long live outsourcing!

    Of course I didn't RTFA! With respect to outsourcing there's no difference between strategic and daily tasks like cleaning and strategic planning. Both need to be done short and long term. I can understand outsourcing occasional tasks but daily and strategic stuff will always be needed. Outsourcing of those tasks is a sign of utterly bad management.

    • If you look at the timeline of events it's highly unlikely that outsourced individuals designed this fault.
    • What definitely needs to be done in-house is whatever your company is supposed to be good at. Ford designs and assembles cars - they shouldn't outsource the design and assembly of cars because that's what they DO - if they stop making cars, they are no longer doing anything and have no reason to exist. Ford is not in the business of making cleaning products, so they probably shouldn't make the cleaning products they use. They should outsource that, buying cleaning products from SC Johnson or someone. Ford i

  • I guess it cost too much to add monitoring and remote management.
  • N+1 guess not (Score:4, Insightful)

    by silas_moeckel ( 234313 ) <silas&dsminc-corp,com> on Friday June 02, 2017 @09:13AM (#54534157) Homepage

    So it was all running in a single DC with a single power bus? Plenty of room at real datacenters they need to stop running out of a closet somewhere.

  • by __aaclcg7560 ( 824291 ) on Friday June 02, 2017 @09:14AM (#54534169)
    This is human error because a contractor accidentally turned off a power supply that caused a world-wide outage? It should be operational error for allowing such a single-point of failure to exist.
  • by JoeyRox ( 2711699 ) on Friday June 02, 2017 @09:14AM (#54534171)
    No sure Bob - just flip it so that we can go get some lunch. I'm starving.
    • Heh, you joke, but we had a server in our server room no one was using any more, it was under powered (ie. old) we had all gotten our stuff off of it and thought we might as well shut it down. So we did. Got a call a couple days later from across the country, "WTF happened to our XYZ?". So we switched it on again. No one knew wtf they were doing on/with the server, and our manager didn't even try to find out, he just said "Well leave it on then". It's probably still sitting there quietly doing whatever
  • by RogueWarrior65 ( 678876 ) on Friday June 02, 2017 @09:15AM (#54534183)

    "Just kidding!"

  • I found the culprit: https://youtu.be/9WYGdstEVJQ?t... [youtu.be]
  • . . . . the power was turned off by a FORMER contractor.

    Then again, BA probably promoted him to executive VP.. .

  • Human Error accounts for 99% of actual power outages in my experience. It's ALWAYS some idiot throwing the wrong switch, unplugging the wrong thing, yanking the wrong wires or spilling something in the wrong place...

    You simply cannot engineer around stupid well enough to fix it, regardless of how hard you try..

    That being said... For a mission critical system in a multi-million dollar company like BA where was the backup site in a different geographic location that was configured to take over in the not-so-

    • The first thing I think of is anything happening at tat location - flood, bomb, larger grid outage lasting more than a day or so - and BA is finished.

      Heck if you were a terrorist now you know exactly where to attack that would truly hose an entire company that brings in a lot of money (and people) to England...

    • by ghoul ( 157158 )

      They had an offsite DR. The DR was setup wrong and did not have the latest data so when they switched to it they started seeing wrong data and had to switch it off

      • They thought they had DR. They were wrong.

        Somebody responsible should have signed off on the plans and routine testing schedule for that. It is a key job responsibility.

      • "They had an offsite DR. The DR was setup wrong and did not have the latest data so when they switched to it they started seeing wrong data and had to switch it off"

        In the kind of companies that outsource the hell out to save some pennies there are two and only two types of highly available systems:

        1) Active/Passive. When the active goes nuts the passive fails to start for whatever reasons (tightly coupled to the fact that the system was tested exactly once, in the happy path, when given the "operational"

    • You simply cannot engineer around stupid well enough to fix it, regardless of how hard you try..

      Nothing can be made foolproof . . . because fools are so ingenious."

  • by ooloorie ( 4394035 ) on Friday June 02, 2017 @09:20AM (#54534245)

    When your business depends on your IT infrastructure like that, turning off the power to a single machine or data center shouldn't bring down your operation; that's just stupid and bad design. Good enterprise software provides resilience, automatic failover, and geographically distributed operations. Companies need to use that.

    And they should actually have tests every few months where they do shut down parts of their infrastructure randomly.

    • by crow ( 16139 )

      Systems in a data center should have two different power systems. The contractor shut one of them down to do some work. That should have been fine. I would guess that the work was to replace or repair some of the power infrastructure. The most likely situation here is that the contractor switched off the wrong one, and the correct one was already off (possibly due to the failure for which the contractor was called in the first place, or else someone had already shut it off for him).

      Process errors like t

    • And they should actually have tests every few months where they do shut down parts of their infrastructure randomly.

      A place I used to work at had an outage at their main data center because of a scheduled test of the power system.

      The week prior they had an outage because the APC battery-backup system in the server room developed a short. One of the engineers flipped the bypass switch while the APC tech was fixing the problem.

      There was a diesel generator next to the building that was used in case the building itself lost its electrical connection and a power-on test was performed every two weeks.

      Sure enough, the power

  • by DirkDaring ( 91233 ) on Friday June 02, 2017 @09:26AM (#54534303)

    of Johnny unplugging the extension cord from the wall and the lights on the runway going out. "Just kidding!"

    https://datacenteroverlords.files.wordpress.com/2017/01/airplane.jpg

  • This is just one step up from the cleaner killing a patient because they unplugged the life support machine to vacuum in the room.

    Pull the other one, it's got bells on it.

  • The more important question is why it took the best part of two days to get things up and running again.

    As for the power outage - A UPS test to check if power transferred to battery/generator that failed maybe?
  • Sounds like a load of baloney to me and really explains nothing. Sounds, in fact, like a cover up from someone who doesn't understand the implications of their lie.

    It still doesn't explain why everything went down so catastrophically. Why was there only one power source? What about back up servers and other redundant systems? Why was it so easy for a contractor to switch the power off? Was he following procedure. What about redundancy? Why couldn't he just switch it back on again (I know, but if its such a

  • Comment removed (Score:5, Interesting)

    by account_deleted ( 4530225 ) on Friday June 02, 2017 @09:45AM (#54534523)
    Comment removed based on user account deletion
    • You do it in production because none of it should cause a massive failure. They bought a DR site and failed to test it. Working at some big shops the DR site was prod every other quarter.

    • We actually test ours by failing over portions each month and making sure everything works.
      For a smaller place I worked which had a limited DR(not everything failed over) parts were tested on a monthly bases, everything was tested yearly with a planned failure that was also to ensure the users had training.
      Some DR stuff also now is really nice in that when you tell it to self-test it creates a separate network so you can test the installation at the COOP site.
    • How does one actually fail-over test things in production in a 24/7 business

      You eliminate any distinction between maintenance operations and DR. The redundant systems should behave the same during upgrade/patching of one of the nodes, a disk dying on one of the nodes, a node hosting active client connections has its NIC die, having a rack die, having the WAN cut, having the entire datacenter lose power, etc.

      If the underlying redundancy system doesn't significantly differentiate discretionary failover operations from DR failover situations, you can run a 24/7 system.

      See Exchange Dat

    • Easily. Regularly switching to the backup site should be done as part of the day to day business operations. For example at my job I work with a company that will switch daily between the main and backup system. It doesn't hurt that the main and backup are running in a hot standby configuration and the backup can take over at a moments notice. They also have 2 additional systems for further levels of redundancy. One is a system that they do a system restore to each day (the previous backup of the main syste
  • It's good practice to make things so simple that no one could possibly mess them up. It works in programming - look at how many JavaScript frameworks abstract an already sandboxed development environment to a point where "signalling intent" is basically all the developer needs to do. Or in hardware -- we're using HPE servers and there is literally a "don't remove this drive" light that comes on when a drive fails in a RAID set. That had to be a customer-requested change after one too many data-loss events s

  • contractor: "so, I guess I'm pretty much done with this company right?"
    CEO: "Not at all! We just spend 1 billion $ educating you!"
    contractor in tears: "oh thank you"
    CEO: "I was joking, dumbass. This is the real world. You're fired and we're going to sue you for 2 billion $".
  • Shit happens and most competent companies plan for it by have redundant live backup systems.
    I can't believe that BA didn't have a live backup system at another site to fail over to.
    Really, this costs money but these cheap bastards don't seem to have a clue.

  • by elistan ( 578864 ) on Friday June 02, 2017 @10:33AM (#54535069)
    Business critical systems should operate in an active/active high-availability scenario in at least two separate locations. That way the loss of any one node has zero effect except perhaps a transaction retry and reduced performance.

    Systems of the next lower level of criticality should have real-time replication to a separate location, so that if a node fails the recovery time is simply what it takes to boot the replacement node.

    A further lower levels of criticality you start getting into things like virtualization clusters to mitigate hardware failures supported by point-in-time backups to mitigate data failures. The IT department's Minecraft server can just be a spare desktop machine sitting on an admin's desk.

    (There are additional considerations for all levels of criticality too, of course, like SAN volume snapshots, and backups too of course.)
  • by TheDarkener ( 198348 ) on Friday June 02, 2017 @11:16AM (#54535487) Homepage

    (Hopefully) an honest, albeit very consequential mistake. I've done the same thing when I was working on the backside of a server cabinet - the PDU was right there by my shoulder and I swiped it on accident. No UPS in the cabinet (a mistake not of my own but the ones who built it out). Fortunately everything came back on. Good thing to have BIOS settings to 'stay off' after a power failure (so you can turn them back on individually and not overdraw power). I feel bad for the guy who did this, it was probably his last day working there.

  • In our secure rooms, we have an EPO button. It's LARGE, red, and inside a cover that you have to lift to turn hit.

    And this contractor turned off the *entire* power for an *entire* datacenter? Yep, yep, not our fault, not your fault, it's gotta be the fault of that guy over there pushin' a broom!

    • by markana ( 152984 ) on Friday June 02, 2017 @01:07PM (#54536699)

      We had an entire data center shut down this way. Facilities *insisted* that the BRB (Big Red Button) not have any sort of shroud or cover over it. Just in case someone couldn't figure out how to get to the button in a dire emergency.

      So one day, they've got a clueless photographer taking pictures of the racks. He was backing up to frame the perfect framing and... we'll, you can guess the rest.

      Now, the button has a shroud that you have to reach into to hit it, and non-essential personnel are banned from the rooms. Total cost of the outage (even with the geo-redundant systems kicking in) was over $1M.

      Just another day in the life of IT.

  • by gweihir ( 88907 ) on Friday June 02, 2017 @12:00PM (#54535921)

    Sure, that may have been the proverbial last drop. But the actual root-cause is that their systems were not able to cope with outages that must be expected. And the responsibility for that is straight with top management. Their utterly dishonest smoke-screen is just more proof that they should be removed immediately for gross incompetence.

  • Been there ... (Score:4, Interesting)

    by CaptainDork ( 3678879 ) on Friday June 02, 2017 @12:06PM (#54535981)

    ... I was walking behind the server rack and unknowingly brushed up against the power cord to the Novell 3.1 server.

    Later, when my boss asked me for an outage report, I told her, "I wish you hadn't asked that."

    I made damned sure that plug was tied to the server after that.

Almost anything derogatory you could say about today's software design would be accurate. -- K.E. Iverson

Working...