Delta Air Lines Grounded Around the World After Computer Outage (cnn.com) 239
Delta Air Lines says it has suffered a computer outage throughout its system, and is warning of "large-scale" cancellations after passengers were unable to check in and departures were grounded globally. The No. 2 U.S. carrier said in a statement Monday that it had "experienced a computer outage that has impacted flights scheduled for this morning. Flights awaiting departure are currently delayed. Flights en route are operating normally." A power outage in Atlanta at about 2.30 a.m. local time is said to be the cause of computer outage. CNN reports: "Large-scale cancellations are expected today," Delta said. While flights already in the air were operating normally, just about all flights yet to take off were grounded. The number of flights and passengers affected by the problem was not immediately available. But Delta, on average, operates about 15,000 daily flights, carrying an average of 550,000 daily passengers during the summer. Getting information on the status of flights was particularly frustrating for passengers. "We are aware that flight status systems, including airport screens, are incorrectly showing flights on time," said the airline. "We apologize to customers who are affected by this issue, and our teams are working to resolve the problem as quickly as possible."
Incompetent IT (Score:5, Interesting)
A power outage in Atlanta at about 2.30 a.m. local time is said to be the cause of computer outage.
Kind of amazing they haven't figured out how to make their system redundant, distributed, and/or robust. It makes zero sense that a power outage in Atlanta should have any effect on a flight going from Salt Lake City to Seattle. If this was the first time something like this had ever happened I could see them being caught off guard but stuff like this is nothing new and multiple airlines have been affected. You would imagine that having a robust network would be job number 1 for their IT people since one failure like this can easily cost tens of millions of dollars.
Re: (Score:2, Insightful)
More than likely an underfunded IT department. IT people often know what's needed for a reliable system, but the higher-ups just seem them as a cost center and won't provide them with a sufficient budget.
Re:Incompetent IT (Score:5, Insightful)
Probably the higher-ups who decided that redundancy is not required are long gone and doing something different now. They could show off how nicely they could cut so many costs to their bosses and probably got a big bonus for the two quarters they were employed before going to the next job.
Re: (Score:2)
Probably the higher-ups who decided that redundancy is not required are long gone and doing something different now. They could show off how nicely they could cut so many costs to their bosses and probably got a big bonus for the two quarters they were employed before going to the next job.
Would that have been before or after they pointed out all these planes have two engines, we could cut costs massively by removing one from each?
Re:Incompetent IT (Score:5, Funny)
"Johnson, get in here"
"Yes sir?"
"You said you apped this in the cloud. How does the cloud go down?"
"Well, er... "
"Where are the damn synergies? I was told there would be synergies!"
Re: (Score:3)
"Where are the damn synergies? I was told there would be synergies!"
Johnson: "Sir, the synergies are configured, just as you ordered. When one part of the system goes down, the whole system goes down. They work together that way, just like you asked."
Re: (Score:2)
Totally agree. On one hand, airlines are not swimming in cash so everything requires a tedious business case. But also it's a fact that many organizations require a major incident before believing those birds of ill omen in IT.
Re:Incompetent IT (Score:4, Insightful)
On the contrary, after going through bankruptcies in recent years and shedding debt, pensions, etc., plus with the current low fuel prices, most airlines are currently swimming in cash.
Re: (Score:3)
Swimming in cash or not, if your entire enterprise hits the pause button stranding thousands of people in places they don't want to be because of a failure of your disaster recovery / business continuity plan, that's a universally bad thing, and an abject failure to plan or realize the potential of a multi-hour data center loss.
Someone fucked up.
Re:Incompetent IT (Score:5, Insightful)
A power outage in Atlanta at about 2.30 a.m. local time is said to be the cause of computer outage.
Kind of amazing they haven't figured out how to make their system redundant, distributed, and/or robust. It makes zero sense that a power outage in Atlanta should have any effect on a flight going from Salt Lake City to Seattle. If this was the first time something like this had ever happened I could see them being caught off guard but stuff like this is nothing new and multiple airlines have been affected. You would imagine that having a robust network would be job number 1 for their IT people since one failure like this can easily cost tens of millions of dollars.
I wouldn't be so fast to lay this at the feet of IT.
I'm certain they wanted to make it robust, distributed and redundant but that all costs money. When PHB's with MBA's see IT as a cost centre, they see all this redundancy as "waste" to be cut back. Budgets are reduced and so are capabilities.
This is the kind of stupidity I see from American companies all the time. Here in Europe, computer downtime like this for a mere hour costs millions of pounds for an airline as they become liable not just for refunds, but also for extra costs as travel insurers pay large sums of money to get people where they're supposed to go. The reinsurers will then send their lawyers to present the airline with a nice bill.
Re:Incompetent IT (Score:5, Insightful)
For any IT discussion on slashdot, as time T increases, the probability of a neckbeard blaming "MBAs" approaches 1
A version of Godwin's law (Score:4)
For any IT discussion on slashdot, as time T increases, the probability of a neckbeard blaming "MBAs" approaches 1
Yeah, it's sort of a riff on Godwin's law. If you blame "MBAs" for a problem, that person has no fact based arguments left so the argument is over and the person doing it loses the argument. It's basically scapegoating and tribalism at its worst.
Management is a pretty easy target. Management has to make decisions with imperfect information (like playing poker) whereas engineers are used to working with greater certainty (more like playing chess) and it's hard for many of them to wrap their head around the difference. Engineers who don't actually know any better seem to think MBA is shorthand for management incompetence. Never mind that a MBA is a degree, not a person or even a category of people. It's as stupid and incoherent as saying CS = incompetent programmers. I happen to be an engineer but I'm also a certified accountant. I have degrees in both engineering and business and I use both in my day job running a manufacturing plant. I can say with absolute confidence that there are just as many engineering school graduates who are bad at their jobs as there are business school graduates who are bad at their jobs. I run into both routinely. And just as many who are good at their jobs as well. Just because you may have run into some of the bad ones doesn't grant the right to paint the rest with the same brush.
Re: (Score:3)
If you want to throw blame around...let's give it to the 1% crowd.
I'll even justify it...watch!
Redundancy and proper backup costs $. Odds of occurance are quite low and pointy-haired people have this habit of cutting budgets to meet spending targets and save money, and all that. Why? Oh, because their bosses say so...the execs and board. Why? Because the company can get an extra $xyz in EPS by cutting budgets back and taking the low % risk on themselves in the short-ish term.
So yeah, we close down the
Re: (Score:3)
Re: (Score:3)
Re: (Score:3)
As an mechanical engineer in the construction industry, I can testify that working with imperfect information is the normal situation for us. And that "management" often requires engineers to boil down extremely imperfect and uncertain cost data into a singular "hard" n
Re: (Score:3)
Re: (Score:2)
Sounds like the case of this one place that let there Diesel tank run dry just from each X days testing runs and then they really-ed needed it ran out as no one setup a auto refill contract.
Re: (Score:3, Insightful)
From the sound of things, I'd say the cabbie played you. I wouldn't be a bit surprised if this is a scam he runs regularly.
Re: (Score:2)
and if you did not have a bag then what will the cab do call the cops?
What about the rules saying that they must take cards? It's broken as they don't want to pay the fees.
Re:Arguing for resources is part of the job (Score:5, Insightful)
Bullcrap. A boo-boo this massive is BY DEFINITION a management fuck-up. It is management's [only] job to ensure all departments are doing their jobs competently. They don't get to say "well gosh, engineering told us they knew what they were doing". Yeah, it isn't EASY, but it's why they get the obscene compensation levels.
Re:Arguing for resources is part of the job (Score:4, Insightful)
Re: (Score:3)
It couldn't possibly be that they predicted exactly this and presented it clearly to upper management who then decided they could get a really fat bonus for keeping costs down and deploy the golden parachute before the inevitable disaster.
Re: (Score:2, Interesting)
AFAIK pretty much all airlines run scheduling software from a single company (I remember reading an article about how Southwest moved from an in-house system to the same as everyone else due to complexity issues), so it's not so much the airlines but this 3rd party that seems to have somewhat fragile software.
Still though, this begs to be something hosted in a datacenter/cloud with an online shadow in the background of another location replicating everything and ready to take over at a moment's notice, or s
Record profits (Score:2)
Still though, this begs to be something hosted in a datacenter/cloud with an online shadow in the background of another location replicating everything and ready to take over at a moment's notice, or something similar. Pretty standard these days, but airlines are so tight for money that they end up sometimes shooting their own feet...
Airlines are making record profits [cnn.com] these days. Arguing that they don't have the money to properly set up the system that runs the whole company is ridiculous.
Re: (Score:2)
Most of the crying and bankruptcies we saw before were actually just a scam to shaft long time employees on their pensions. They're fine now and they were fine then, it's just that now there's more money for executive bonuses and the hookers and blow fund is overflowing.
Re: (Score:2)
Dunno about the scheduling package, but most airlines contract with one of the major providers of reservations management services. At the time I worked in the field (little more than 10 years ago
Re:Incompetent IT *management* (Score:2)
I'll bet you dollars to donuts that the IT folks squealed like stabbed piglets that they needed a backup system alternative.
But the management chain did not want to swallow the costs.
Who knows? Maybe the costs of dealing with this fiasco will be cheaper than having a backup system . . . ?
Either way IT looks bad (Score:2)
I'll bet you dollars to donuts that the IT folks squealed like stabbed piglets that they needed a backup system alternative.
I'll take that bet. I'm betting they either overlooked something technical or they are just really bad at making financial arguments. Since a key part of engineering is being able to justify what you want to do in financial terms my guess is that they just weren't very good at their job. Justifying equipment to prevent an outage that would cost millions of dollars per minute is trivial.
Who knows? Maybe the costs of dealing with this fiasco will be cheaper than having a backup system . . . ?
Maybe but I doubt it. Given that Delta and other airlines are experiencing record profits, it's hard to see them not un
Re: (Score:2)
Who knows? Maybe the costs of dealing with this fiasco will be cheaper than having a backup system . . . ?
By the time the Bean Counters get done? Depend on it. The books aren't going to show the future revenue lost because people swore off Delta in disgust and anyone who depends on surveys to obtain intangible data is going to get what they deserve. Even allowing for the fact that many people don't want to waste time on a survey to begin with, you can't survey people who thought "Delta? Those screwups?" and never even considered the company. Well you can, if you're into blanket surveys, but those are worth even
Re:Incompetent IT (Score:5, Insightful)
Kind of amazing they haven't figured out how to make their system redundant, distributed, and/or robust. It makes zero sense that a power outage in Atlanta should have any effect on a flight going from Salt Lake City to Seattle. If this was the first time something like this had ever happened I could see them being caught off guard but stuff like this is nothing new and multiple airlines have been affected. You would imagine that having a robust network would be job number 1 for their IT people since one failure like this can easily cost tens of millions of dollars.
Scaling out is easy if you're Facebook or Google and nobody cares about a perfectly consistent truth. If you run transaction processing like airplane tickets people damn well like to know if they got their ticket booked and Delta want to know if they got paid, they want ACID compliance not "eventual consistency" NoSQL. That usually leads to mainframes and 99.99999% uptime systems with redundant power, network links etc. not clusters and distribution. Maybe also a hot failover next to it hooked up by a fat pipe. But if shit hits the fan big time in the data center, it goes down. Doesn't look like it took them *that* long to scramble what I assume is their cold backup online.
The passengers aren't happy but hey sometimes shit happens with planes or crew or airports or whatnot leading to delays and cancellation. I've had a rescheduled flight and night in hotel because KLM got delayed and weren't allowed to liftoff because the destination airport was closing, it sucks but this is a fact of life for airlines. It becomes a big story because it happened to lots of people at once, but over say a year how how big a deal is it really? I'm sure they'll do a post mortem but I'd be surprised if they moved away from a centralized architecture.
Re: (Score:3)
I know people get upset about these kinds of things and airlines have really high public exposure to failure. But processes do fail. I don't work for Delta and don't have any affiliation with them. But I work in the sector and have felt the sting of system failure. Don't be quick judge and hindsight is 20/20. An example of what could have caused this is a complex network + storage device failure. It is reasonable for devices that never get turned off to experience failure to turn on if they ever lose power.
Re: (Score:2)
I have consulted with a large airline, can't remember if they are #1 or #2 right now, and believe me, it is a very very complex system.
We were brought in to reduce the complexity and increase the resiliency in case of a disaster/failure.
The current environment, which is geographically paired mainframes with mid-range "helper" apps and data caches is about the best it can be.
Getting everything coordinated from meals ordered, drinks loaded, fuel, baggage, load balance plan, seating, payments, etc, is incredib
Re: (Score:2)
It doesn't have any impact on the flight that is already in the air. However once the plane lands the computer has the instructions for where the plane is going next, how much fuel to put into it, where to route the luggage and any cargo that was in it, what to load into the plane for the new trip, what supplies to replenish, what passengers are supposed to get on board, who is supposed to work on the plane, etc. Without all that information that plane is grounded.
Re:Incompetent IT (Score:4, Insightful)
One day the power went out. The UPS kicked in. Power usually came back within a couple minutes so I kept working. After about 10 min, the UPS began warning it was nearly drained. So I shut down the desktop and switched to my laptop. Unfortunately I hadn't charged it so I got a low battery warning after about an hour. I lugged out the car battery, clamped on the leads for the inverter, plugged the laptop into the inverter, and fired it up. I was back in business again.
Got on the laptop, logged in to work. 30 seconds later the Internet went down. No cable TV as well. The battery keeping the cable company's equipment powered must've died.
You can make all your systems redundant, distributed, and robust. But unless you control all the network lines between you and all the places you need to communicate with, you're not in total control over the reliability of the system. (And if you're curious, I was without power for 3 days. I had to move my refrigerator's contents outside to keep them cool since it was winter, and use a wood stove to keep the house warm and cook my meals. I dropped plans to buy a generator since there was no point if my Internet connection would only last about 90 minutes.)
Re: (Score:3)
As far as I can tell the surface is a great machine. I've known like 50 people who own them and they've all said they love it.
Now tell me how many of them were aircraft pilots, air stewards or the likes.
Maybe you just have an irrational bias against anything Microsoft because you don't like some things about them?
I have a very rational bias against silly business decisions, mr. Coward.
I am pretty sure that the money spent on those 11,000 tablets could have been better spent on backup servers or other essential IT equipment, not on something that looks like a pure marketing decision.
That's about $100 Million per day in lost revenue (Score:3, Interesting)
Re: (Score:2)
You would think they would have a backup for the backup power. But like someone earlier said, this outage sounds suspicious.
Or if you are down for 2 days ($200 million), and the cost of having a fully redundant system is more than $200 million (equipment, people, process, ...), from a business sense, it may make more sense to just accept an occasional outage.
Report: Fire destroyed generators (Score:5, Informative)
According to the flight captain of JFK-SLC this morning, a routine scheduled switch to the backup generator this morning at 2:30am caused a fire that destroyed both the backup and the primary. Firefighters took a while to extinguish the fire. Power is now back up and 400 out of the 500 servers rebooted, still waiting for the last 100 to have the whole system fully functional.
Re:Report: Fire destroyed generators (Score:5, Insightful)
Here's the thing that amazes me.
500 servers.
The airline runs on 500 servres.
I was part of an early social networking site that, at its peak had 20 M users, with about 10K actively using the site at any given moment. We ran with 200 servers and had really very excellent render time (this was getting on to a decade ago, and if our page loads ever got above 1 second it was considered a near crisis; our email/messaging system, that I wrote, handled 150 M messages per day). It just can't be that hard to run an airline site compared to running a web site that peaked at Alexa 100. They need 500 servers? Five HUNDRED servers? And with the resources of a multi-billion dollar company, they're STILL ALL IN ONE LOCATION?
They need a new IT team. Or a new management to give them the support they need.
Re:Report: Fire destroyed generators (Score:4, Insightful)
"with about 10K actively using the site at any given moment."
You actually think Delta only has 10k actively using their systems at any given moment? They probably have that many ticket counter staff logged in, not even counting customers, technicians, pilots, and so on.
Yeah, I get the 'why is your backup in the same building as your primary', but they probably need 500 servers.
Re: (Score:3)
Calm down. Your social media site wasn't flying a half million people around the world in pressurized aluminum cans every day.
Not even counting future travel reservations or queries, how many DB transactions do you think they handle per passenger per day alone? And none of that counts any other potential transactions, such as service info, flight data such as aircraft telemetry, employee data, regulatory information and so on.
500 servers sounds almost too low, especially when you consider that probably mo
Re:Report: Fire destroyed generators (Score:5, Insightful)
Re: (Score:3)
What exactly is "a server" though? A server can be anything from a single processor with a reasonable amount of memory to many multi-proc multi-core beasts with more memory than most people have disk space. Toss in virtualization and does 1 server = 1 physical machine, or 1 server = 1 virtual machine?
Being an older airline, I'd be surprised if there wasn't one or more large mainframes in the mix as well, something your social networking site probably didn't have.
Re: (Score:2)
Re:Report: Fire destroyed generators (Score:5, Insightful)
A well maintained ATS should be able to function flawlessly for many, many years (like 20 years). To have faulted so badly that it took out the whole switch (which would definitely make the primary and generator feeds inaccessible) sure sounds like deferred or non-existent maintenance to me.
Re: (Score:2)
At 500 servers thats not realy a whole lot of data center real estate. A well build DC would not have the A and B buss gen sets next to each other though all bets are off once you get some overzealous firefighters on site. Having no DR setup in place is laughable.
Re: (Score:2)
Re: (Score:2)
It's common it's also a bad idea. Granted getting utilities to feed from 2 substations via diverse paths is a pita unless the location was picked for that purpose. While I know it's all too common it's far better to split utility and run a gen set(s) per UPS and keep physical separation between them thus their own ATS gear. None of that matters if the fire trucks roll in and insist on everything being shut down.
Re: (Score:2)
My question would be why primary and secondary generators were placed close enough that a fire with one would so easily affect another. I get that there are some serious temptations, including not wanting to run main power feeds very far or shared fuel storage.
But the kind of proximity that would pose a dual generator fire risk seems like a bad idea.
Ironically, this even crossed my mind at a year old data center I was at last week. Both backup generators were fairly close together and I wondered what kind
Re:Report: Fire destroyed generators (Score:5, Informative)
The failure rate of ATSs is pretty low (when they're maintained), so it often becomes a value engineering decision during design. Yes, you could have each Generator connect via its own ATS, thus distributing the risk, but in so doing you increase your constructions costs, increase your maintenance costs, etc. The bean counters don't like that, and it becomes hard to convince them that it's worth it when you can't come up with statistical proof that a failure of the ATS is likely.
Re: (Score:2)
I was going on a previous post's claims of a generator fire, rather than an ATS failure.
I would think an ATS failure resulting in fire would be pretty darn hard to recover from in a timely fashion due to what I would expect would be some major electrical rework to replace the ATS, housing, and feeds, and related switchgear.
I would guess that a "modern" data center design would isolate these components enough that even if the ATS melted to slag in place it would be a matter of just replacing the ATS. At a
Re: (Score:3)
Once the fire dept is onsite all bets are off. They will kill power from the other generators etc to insure crew safety. This is where prep is key so that they feel safe working an electrical fire without killing all power.
Re: (Score:2)
The Il toll way is loading I-90 up with backup p (Score:2)
The Il toll way is loading I-90 up with backup power all an long the new smart highway part how redundant is that system? If it fails people can end up with free tolls.
Re: (Score:2)
From the reports I've been seeing, it wasn't the ATS that failed, but rather a generator that caught on fire -- and in order to extinguish the fire safely, they had to cut commercial power.
Freak accidents like that happen. But what also happens is that companies that big invest in redundant systems in geo-redundant locations. What happens if a tornado, sharknado or other natural disaster happens and takes out the physical servers? Does Delta just cancel flights for the next month while they rebuild?
Re: (Score:2)
It's a servenado just hope that your data is encryption or it may be raining identity theft
Re: (Score:2)
Re: (Score:2)
I should have seen this before my original posting. It makes perfect sense why they have been down so long. I've experienced this as well only it was at 16:00 instead of 02:00. *sad face*
Re: (Score:2)
Not realy failing to do thorough testing carries excessive risk. It would be very rare indeed that proper thermal imaging would not have caught this before it was a fire. That takes humans vs a control panel that runs the gen sets once a week etc.
Re: (Score:2)
o/~ Because we're Delta airlines... (Score:2)
...and life is a fucking nightmare o/~
(https://youtu.be/vzeOsEkzeA0
John Mulaney's stand up bit on Delta. It's worth it.)
Mega lag: They are back in the air again! (Score:2)
http://www.tagesschau.de/wirts... [tagesschau.de]
Mainframes in the airlines (Score:3)
Last time I worked with the airline industry, they were still heavily reliant upon mainframe systems. That means putting redundant equipment at diverse datacenters is more costly. It's not like spinning up a new rack of x86 VMWare servers.
Re: (Score:3)
Many planes are leased or rotated off of budget after a certain maintenance schedule. Airlines run very thin profit margins despite how it may appear. Think about all the choices you have when flying? The Northwest airlines portion of Delta used to run mainframes in Minnesota. I don't know what they use in Atlanta. Mainframes can be much more efficient than a bunch of Oracle/Microsoft DB's running on VMware. It isn't a trivial task to fail over to DR for most companies. One of the scariest things are DB syn
TIme to move the servers (Score:2)
Minnesota seems like a good place to house them....
Re: (Score:2)
IIRC, NWA, which was merged into Delta a few years back, had a backhoe outage when a fiber trunk got cut.
For those claiming bad managers and saving money: (Score:4, Interesting)
Most of y'all probably don't know what you're talking about. Here's what's going to happen:
1) Delta will file a loss-of-business / data system failure claim after things are stable again
2) They'll haggle with their insurer long after this little story is forgotten (and yeah, lots o' heartache today, but it's still probably going to be little.)
3) Delta will get a settlement of some dollar amount
4) Some bean counter will eventually tally the cost of that policy versus the payout versus how much all those redundant backups would have cost. The accountant will most likely conclude that it was a smart idea to have bought that insurance policy and NOT paid out the multimillions of dollars IT was asking for in redundant systems.
5) The insurance company will note the payout as a blip on its financials (probably already expected by the actuaries.) Insurance company will keep making profit.
The little air traveller is screwed and blued, but Delta and its insurer will keep flying. Doing business today without a data loss rider on your business insurance would be the really stupid idea, much more so than wasting money on redundant systems that are more expensive than said rider.
Re: (Score:3)
Accountants don't have a good idea of lost business opportunity or lost customers.
So while the basics may make financial sense, that doesn't actually mean it was a good idea.
Surface vs Actual (Score:2)
While on the surface it may appear their IT department is "incompetent" as one person pointed out, other factors could have contributed to the outage. Management not approving proper tests to be done or another datacenter in a completely different location. Improper maintenance on the generator(s). While IT may request things be done or placed a certain way, doesn't mean the facilities team care or understand why and do it their own way anyways. Like why have two generators located right next to each other?
Sounds like a problem with flight planning (Score:5, Informative)
I used to work on one of these systems.
The flight planning system takes inputs from several sources - weather forecasts, notices about airspace closures, etc. (NOTAMs), and booking info - and creates an optimal flight plan for the aircraft.
A modern airline doesn't have enough flight planning staff to take over manually if the system fails, so if your flight planning goes out, your fleet is gradually grounded.
The large number of servers is due to the optimization problem. You need to take into account the flight conditions and fuel costs in different locations in order to decide your route, altitude, and fuel loading. Since fuel is a huge percent of the operating cost of the airline, it pays to invest a little extra computing power into optimizing these and save a bit fuel on each flight.
Our system had lots of redundancy but, with all the data feeds, there are lots of moving parts. It's not hard to imagine a scenario where, for example, you get everything transferred over to your disaster recovery site, but for some reason the weather feed isn't coming in and you can't make flight plans.
Paperless Tickets (Score:5, Interesting)
This story brought to you courtesy of paperless tickets. Yes they are cheaper, yes it is simpler if people can print their own tickets, but the IT has to be up and running.
I remember an airline IT outage back in September 2004, there was a bug in the OS's error-handling routine for a particular class of error. This had all been tested with this particular OS level and had worked, but they had been forced to change the OS configuration to accomodate some new software and the bug was in place. Moving to new discs required a reboot, an additional configuration error caused problems. If it had been fixed within (I think) 90 minutes all would have been fine. The outage was 8 hours.
Passengers turned up at the airports with their paper tickets and were allowed to board. Any pre-allocated seating was ignored. People were laughing about flying the way things used to be, a good time was had by most.
Then came paperless tickets. The next outage had effects more like those we see in this case.
Cue the Cloud Consultants in 3, 2, 1.... (Score:2)
I guarantee the cloud infrastructure guys are salivating at the opportunity to convince the MBAs to ditch Delta's data center. What they won't mention is how much it would cost to actually implement instant failover capability in a cloud environment. I'm not anti-cloud, but I do think a business as large as Delta isn't going to see a lot of cost savings over what they're paying now for equipment. Microsoft and Amazon doesn't give away capacity for free, and you often pay dearly for certain key elements (Iaa
Re: (Score:2)
500 servers is like 10-20 racks. That's a very small datacenter in the company's basement or perhaps 1 mainframe with 500 instances that was recently (in the last decade) converted to a cluster. Either way, if a worldwide system is located in a single datacenter, I'd say the latter is probably the case which is currently IBM's modus operandi when a customer wants to upgrade an old mainframe
Single data centre for critical resource? (Score:2)
Blimey I wouldn't do that and running a bog standard stream service never mind an airline with 100 million a day of revenue.
500 servers is about 50 racks. About 500,000 a year plus about 2,000,000 for kit and 4,000,000 for software and licenses and 250,000 for interconnect . So capex 6,000,000 and opex call it 1,000,000 per annum.
I normally rate a major dc failure ( more than 10min ) at about once every 5 years.
Easy business case.
Also generator and ups fail over is tough to test with one dc. Which hit this
Security (Score:2)
Re: (Score:2)
Re: (Score:2)
Some Lessons Are NEVER Learned (Score:2)
In the summer of 2003, the Great North-East Blackout hit New England and other areas in the U.S. and parts of Canada. My wife were in Montreal at the time. When we tried to fly home non-stop to California from Trudeau International Airport (called Dorval International Airport at that time) via Air Canada on an early morning flight, we instead found ourselves flying in the late afternoon to Dulles in Washington, DC, changing planes, and then flying home. We arrived at our house more than 12 hours late.
No,
Re: (Score:2)
They probably did but then tried to shave the costs down again when a new generation of managers came along 2 years later "why do we have a datacenter doing nothing most of the time, let's go to the SAAS/MAAS/PAAS model (which I think was the buzz word for shared hosting 10 years ago)"
Delta is number TWO in more ways than ONE (Score:2)
Namely, they always shovel out heaps of number two whenever something goes terribly wrong. Their response policy is ALWAYS to tell LIES. Its POLICY to SPIN ALL NEGATIVE PRESS ATTENTION AT ALL TIMES. The truth will only make it worse because they know they are prone to major fuck ups and they have lots of enemies. They just don't want any of their cheap, stupid, or dishonest screw ups to look like they are willing and able to constantly screw up service for their customers since it is a calculated risk
Re: (Score:2, Informative)
the auto install of windows updates drains your battery and does not stop for battery mode or ups shut down commands.
Re: (Score:2)
Ha ha ha...very funny...not......
This was a power issue (cue the 'the IT staff needs to be hung by their scrotums for such shitty power infrastructure' comments).
Re:Shouldn't have upgraded to W10 ! (Score:5, Insightful)
Even IF one of your data centers has a power outage (which should not happen as you should have backup generators and batteries that give power until the generators are spun up), you should always have at least ONE other backup data center to take over if something really fails for you.
Fire when not ready (Score:3)
Even IF one of your data centers has a power outage (which should not happen as you should have backup generators and batteries that give power until the generators are spun up),
Actually, what I'm hearing is that a fire in the backup generator took out the primary generator. So, this is a case in which the backup was the problem, not the solution.
Backup data center? (Score:5, Insightful)
Actually, what I'm hearing is that a fire in the backup generator took out the primary generator.
Shouldn't have any effect on the BACKUP DATA CENTER. One facility can go down. It happens. It should take a thermonuclear war to take out several if they are doing it right.
Insurance (Score:5, Informative)
Without Federal requirements there is no way a corporation is going to spend that kind of money.
A few failures like this one and they'll dig into the couch cushions to find the change for it. Having a backup data center for stuff that will shut the company down is not exactly a tough thing to justify. This shutdown alone would probably justify the cost in a single day.
They have legal protections in place to assure they retain their terminal slots, so while they aren't making money now they won't lose in the long run.
Perhaps but if they managed their IT properly they wouldn't have to lose money now. They can buy the insurance or they can take the risk of serious illness so to speak. Their choice and their funeral. Sounds like they rolled the dice and came up snake eyes today.
The only businesses with total data recovery sites and plans to actually use them are Banks, and that is because they are required by the FDIC.
Not true. Some medical practices have them. Some internet firms have them (at least for the mission critical stuff). Some bits of the military and government have them. Insurance companies have them. Stock exchanges have them. And there are more as well. If it's valuable enough you have a backup data center of some sort.
Re: (Score:2)
I work with BCRS (Business Continuity and Resiliency Services ) at IBM on a regular basis as an IBM Cloud Architect.
ALL KINDS of companies have Recovery plans, but it sure is more likely in the industries you mentioned.
I know of at least one airline competitor that has multiple sites for their business, and came to IBM to get 2 more built because the current two were too close to comfort after Hurricane Sandy proved how large of an area a single disaster could impact.
application vs infrastrucure recovery (Score:2)
There is a great deal of difference in recovering certain applications or having multiple sites running a subset of one facet of your operation. A full structure recovery requires the hardware, staff and FULL data, e.g. full application and user data available to recover from scratch. That kind of overhead is enormous. Recovering mission critical stuff is par for the course, but recovering everything in a DC needed to do day to day operations in the event of a full infrastructure failure is a different beas
Re: (Score:2)
Yeah, except that they do. Lots of them.
It's called "having a disaster recovery / business continuity plan"
application recovery vs infrastructure recovery (Score:2)
Do they have a an entire recovery DC or space in someone else's DC ? Most business have plans to recover certain applications or move them to run on backup/development hardware. I worked for years in Contingency recovery and most places I've supported have space to recover applications should they fail, but few have the dedicated space or a plan to recover an entire infrastructure should a failure occur, and fewer have a plan to move BACK to the original space when the problem is fixed. The cost to maintain
Re: (Score:3)
Off the top of my head I can name over 20 companies that have full failover to a backup DC. One of them is an Airline that everyone knows the name of.
Hell, I have configured stretch clusters for companies so that in the event of a DC failure the secondary DC is available with 0 down time and the failover is automatic. So it is done, it is normal operating procedures/best practices, and there is no reason the SECOND LARGEST AIRLINE IN THE USE IS NOT DOING IT!!!
If you want to argue that some small company of
Re: Backup data center? (Score:2)
And companies that need their websites to make money?
For some I know of, the IT staff have done everything in their power (identified risks, probability of the risk, impact in the event, possible mitigations, cost and time to implement mitigations etc.). "Business" (aka bean-counters) have decided that they accept the risk of the current status (business continuity plan takes more than a business week to restore basic services in the event of a complete failure in the primary site).
You are making things up completely, my employer and all of our competitors have multiple levels of redundancy (multiple datacenters, multiple availability zones per datacenter). If I want to deploy a little REST api our policies and systems enforce 9 total VM's, 3 per AZ and 3 total AZ's. Our competitors are at least as serious about uptime as we are.
And this is the cheapest possible service to provide redundancy for. When you
Re: (Score:2)
also with a fire it's more likely for someone to hit the red button or some automated system to kill the power. Also with the firemen on site say to kill all power it happens likely with a hard power off.
Re: (Score:3)
Re: (Score:2)
IT folks usually put in the requirements for the power infrastructure, but I've almost never seen them handle it.
Often, it's building/maintenance who handles it.
And as with any project, it's probably upper management didn't want to pay for the level of redundancy that IT said was required.
Re: (Score:2)
It's fun watching every department point to every other department for blame.
IT - It's upper managements fault. We assumed they took care of it. Or now, it's facilities fault.
Upper MGMT - It's ITs fault. We assumed if it still needed doing they would have told us.
Just need IT and Upper MGMT to talk first so they can sync the blame on facilities.
Re: (Score:2)
Well the testing is assuming that it's there at all.
Re: (Score:2)
Re: (Score:2)