City's IT Infrastructure Brought To Its Knees By Data Center Outage 102
An anonymous reader writes "On July 11th in Calgary, Canada, a fire and explosion was reported at the Shaw Communications headquarters. This took down a large swath of IT infrastructure, including Shaw's telephone and Internet customers, local radio stations, emergency 911 services, provincial services such Alberta Health Services computers, and Alberta Registries. One news site reports that 'The building was designed with network backups, but the explosion damaged those systems as well.' No doubt this has been a hard lesson on how NOT to host critical public services."
First post! (Score:4, Informative)
Re:First post! (Score:5, Funny)
Re: (Score:1)
Re: (Score:1)
The unofficial offsite backup (the trunk of a certain station wagon) shall henceforth use the Telus' office parking lot.
Re: (Score:1)
Re: (Score:2)
Re: (Score:1)
You don't pay big money to IBM for uptime, all IBM does now is resell other companies and take a share of the money. And become a middle man.
Re: (Score:2)
This is true, we just finished doing evaluations and IBMs quote included subbing out ALL the work to multiple sub vendors, the only part with IBMs name on it was the Quote.
Re: (Score:2)
Incorrect. I know many IBMers that have been restoring service to that datacentre for more than 50 hours, on 2 hours sleep. It sucks when you are doing it, but it's worth much geek cred in my book.
Re: (Score:1)
Shaw has 2 major locations in Calgary, the only people effected are those that use the downtown site
Or... (Score:4, Insightful)
Re: (Score:2)
Re: (Score:3)
Re: (Score:2)
Irrelevant. The fire department became involved and that typically means you shut it all down if they say so (or they'll do it for you), even the redundant stuff that's still running. The only way around that is separate physical buildings.
in some buildings / data centers the fire system (Score:2)
in some buildings / data centers the fire system can kill most of the power
Re: (Score:2)
Which is a *good* thing. Fire and live electrical systems don't mix well.
Re: (Score:2)
Re: (Score:2)
Uhh, it is stupidity. Having your DR in the same site as your production servers is monumentally stupid. Most companies I've worked at have rules that state a minimum of 5km distance between production & DR sites in case of catastrophic failure. The Department of Defence here in Australia has 500km between their two production sites & DR. Our biggest service provider has 1000km between production & DR.
The only time I have seen the same building used is for redundancy & then the two comms/ser
Re: (Score:1)
Yeah, silly Australian Department of Defense with poor network redundancy planning -- if a meteor hits Australia and wipes it off the map there will be no backups available.
Re: (Score:2)
No Site Level Resiliency? (Score:5, Insightful)
Whoever designed this should be smacked in the head. You never have critical services relying on a single location. Should have redundancy at every level, including geographic (ie not in the same flood / fault / fire zone).
Re: (Score:2, Informative)
The issue is IBM runs the Alberta Health Services and other infrastructure from the Shaw building of which IBM has their own datacenter in. IBM had no proper backups in place for these services.
911 being the most critical was also not affected, just Shaw VoIP users couldn't call 911 if their lines were down -- obviously (only ~20k people downtown were affected).
Re: (Score:2)
It's IBM's fault that their customers didn't have a DR plan?
Re: (Score:1)
Add 'tornado zone' to that list.
If you host all your cloud services at Rackspace in Texas and a tornado happens to rip apart their datacenter, well expect a few hours/days downtime. And you better have offsite backups of mission critical data or that's a long bet that is getting shorter every day.
Re:No Site Level Resiliency? (Score:4, Insightful)
This is why i do not understand the rush to cloud space. The same types of outages that apply to locally hosting the data apply to the cloud space providers. You still need the backup's, disaster plans with the ability to access the servers and such, much of the same stuff if not more then you would need if hosting it yourself. Is the clouds that much cheaper or something? Or is it more about marketing hype that talks PHBs and supervisors who want to sound cool into situations like this where diligence is not necessarily a priority?
Re: (Score:2)
Re: (Score:2, Insightful)
The cloud is not cheaper, unless you're doing things really wrong in the first place, like buying tier 1 servers or running windows.
It does provide economies of scale, can be somewhat cost-competitive with doing it yourself for at least some things, and you don't have to deal with hardware depreciation and the constant refresh cycle.
The big cloud providers also integrate a lot of services that would be a pain to build internally for small and mid-sized clients.
Hype explains the rest. PHBs are always looking
Re: (Score:2)
Arguably you could still use two different cloud providers after verifying (and continuing to verify over time) that the infrastructure (and connectivity to it) is actually redundant.
Re: (Score:2)
Texas and a tornado happens to rip apart their datacenter
I've always wondered why stuff wasn't built in more 'secure' locations. Stuff like monolithic domes designed to take F5 tornadoes.
Re: (Score:2)
I've always wondered why stuff wasn't built in more 'secure' locations. Stuff like monolithic domes designed to take F5 tornadoes.
Mostly these days it's built in whatever they have available, because actually building anything costs too much. That's got to be responsible in large part for the rise of the shipping container as a data center... it's a temporary, soft-set structure. You only need a permit for the electrical connection, and maybe a pad.
Re: (Score:2)
I've always wondered why stuff wasn't built in more 'secure' locations. Stuff like monolithic domes designed to take F5 tornadoes.
The cost is impressive.
Re: (Score:3)
Is everybody safe
That is, quite literally, someone else's problem. It sounds calloused to say, but seriously, /. isn't a site for first responders, it's for IT and CS types. It's not like we're looking at one thing at the expense of another here, your data (and 911 access) should work, people shouldn't die in a fire and your data shouldn't be hosed if it was housed there.
As to your point about universities. As tragic as it might be if someone died in a fire tomorrow at the university I graduated from 10 years ago, I stil
Re: (Score:2)
No one was hurt when a blast in a 13th-floor electrical room on Wednesday brought down Alberta Health Services computers, put three radio stations off the air and affected some banking services.
because no one was hurt, you fucking chicken little. if you gave a shit at all about "the people" you would have RTFA to find out. go ahead and read it. i hope it brings stuff into perspective for you.
Re: (Score:3, Interesting)
Imagine if the library of Alexandria had backup copies of all those books, manuscripts and other treasures? How about Constantinople? I'm sure there were people that tried to protect that data who believed it was worth more then their life. I hope that brings stuff into perspective.
Re: (Score:2)
it's boni
Maybe the city/provinces should skip on redundancy (Score:3, Interesting)
The issue with the city/provincial critical services is that they didn't have geographical redundancy due to the cost. Yes the building had redundant power, and networks but it was the whole building that was affected by this. At the end of the day, Shaw did fuck up, but all the essential servers completely fucked up.
Re: (Score:3)
The real problem was with the datacenter that is housed in the same building. 20,000 consumer class Internet outages is nothing c
Re: (Score:3)
Re: (Score:2)
The issue with the city/provincial critical services is that they didn't have geographical redundancy due to the cost. Yes the building had redundant power, and networks but it was the whole building that was affected by this. At the end of the day, Shaw did fuck up, but all the essential servers completely fucked up.
Cost should not be an issue when we're talking about life or death critical services that are provided by some level of government. You spend what you have to spend to get the job done right, not more, not less. We're also not talking about a town with a population of 16 but a city with a population of 3,645,257 (in 2011). I am quite sure that they had the means to do this the right way and just chose not to.
Shaw is an ISP (Score:2)
Re: (Score:3)
Actually, Shaw's a media company - they do not only internet, but phones as well. Those went down as well (Shaw has business packages for phone service over cable, but downtown, I'd guess they also have fiber phone service too).
And it really isn't a screwup - they were doing somet
Re: (Score:3)
Redundant internet connections are no guarantee of no single points of failure.
"Redundant" connections can sometimes wind up on the same fiber somewhere upstream, unbeknownst to the subscriber.
Most telecommunications infrastructure in any area has some very large aggregation points also... Telco Central Offices; a single point of failure for telecommunications services served by that office.
What good is working 911 service, if nobody can call in, because all their phones are rendered useless by
Fukushima Daiichi - Anyone? (Score:2)
Putting all of ones eggs in one basket, all your reactors on one backup generator or all your data in one place is the reason for these catastrophic failures. Back 30 odd years ago we had mainframes, I used to operate an IBM 360, then along came the internet and the distributed computing model where the system didn't have all of its data in even one box in a company. There was a box on everyone's desktop. Now that's come full circle with the "Cloud" initiative where all your data is housed in one place (dat
Re: (Score:1)
Actually the backup generators were located in the basement as per GE's original design. Engineers requested to locate the generators in a more secure (from tsunamis) location when the plants were built but were overruled by upper management.
However the location of the generators at Fukushima is irrelevant to my point that had the power grid been a distributed system with local batteries or what have you then all residents would not have lost power when the plant was flooded. This is not an argument about t
Limitations (Score:2)
There are limitations to how high your HA can be depending on the volume of data you process and the infrastructure available.
In this case an entire building was knocked out by an exceptional circumstance. You can plan for that by having buildings in multiple sites, but as you get farther apart the connecting infrastructure gets more difficult. In this Shaw is an ISP (one of the big-boys in that part of the country), so in that case you'd expect that access to fast connections should be their forte. One thi
Re: (Score:2, Interesting)
It's kind of funny for me to hear someone call Shaw one of the "big boys" after working on a number of Telecom projects in the USA. In the scheme of things Shaw would only rate being maybe a tier 3 player. Their maximum customer base is maybe 8 million potential people .... not households (and I'm presupposing they are in Saskatchewan and Manitoba now otherwise subtract a couple million). And they compete with Telus and Manitoba Tel and Sasktel (or whoever it is there). That's definitely tier 3 or smaller.
Captain Obvious (Score:3)
> No doubt this has been a hard lesson on how NOT to host critical public services.
And no doubt the lesson was not learned.
Re: (Score:2)
Yeah, if it was entirely government owned, it'd be rock solid and cheaper, too.
Not surprising (Score:3, Informative)
There are buildings all over the US that can have a similar effect but worse. In Seattle it would be the Westin Tower, get the two electrical vaults in that building and you'll pretty much take most phone service, internet service and various emergency agency services all over the state offline for a while.
What I now consider a classic example is the outage of Fischer plaza. It not only took down credit card processors, bing travel and a couple other big online services. It also took out Verizon's FiOS service for western washington.
http://www.datacenterknowledge.com/archives/2009/07/03/major-outage-at-seattle-data-center/
(apologies don't comment a lot and don't know how to properly link)
The big problem is that many services no matter how redundant they may seem to be, now-in-days have a upstream geographic single point of failure (Ala my Westin tower example.)
Transformer fire? (Score:2)
Transformers sometimes fail catastrophically and without warning. Other than keeping transformers outside, such things simply fall under "shit happens". Then once the fire department gets involved you turn off all the power: your backup generators, your UPS, everything.
911 not down (Score:2)
Re: (Score:2)
Not just Shaw's network was affected (Score:1)
Our primary internet connection was Bell and Shaw was our backup. To our surprise Bell's downtown network relies on Shaw's backbone and was ultimately affected by this monumental single point of failure.
To get back on the internet without having to fail-over to our DR site we came up with a crazy solution of hooking up a Rogers Rocket Hub. The damn thing worked without our ~85 employees and 3 remote users noticing a difference.
Over the next few weeks we will be canceling all of our Shaw services, signing up
Re: (Score:1)
Have fun with Enmax, their stability record is pretty awesome. Try getting any kind of service on a weekend, their noc number on the weekends goes to a pager and they will call you back within the hour. I deal with them so often it isn't funny. And watch out, they contract out the last mile in many of their build out. Their like using telus and shaw and the best, they won't tell you.
Poof! (Score:2)
What really happened... (Score:4, Interesting)
Shaw had a generator overheat and literally blow up which damaged their other 2 generators and caused an electrical arc fire. This fire set off the sprinklers and in turn, the water shut down the backup systems.
Yes, it was stupid that Shaw housed all their critical systems, including backups, in one building but even more stupid was the fact that they used a water based sprinkler system in a bloody telecom room.
Also, Alberta has this wonderful thing called Alberta SuperNet, which, if I recall, all health regions use to use before our government decided to spend hundreds of millions of dollars to merge everything together and spend even more money to use the Shaw network to connect everything. The SuperNet was specifically designed with government offices in mind but nooo, why use something you have already paid for when you can spend more money and use something different.
Re: (Score:2)
Halon has been banned for quite some time now. The replacement, Halotron, was just recently restricted.
CO2 would make the most sense.
The problem from what I understand was the generator room. A CO2 system would be ideal in this situation, you're dealing with lubricants as well and foam just makes an awful mess.
What I dont understand is how the sprinkler system was involved at all. When that valve bursts, it only flows at th affected area. Its not like the movies where if one pops the whole system goes off.
I
It was so bad.. (Score:5, Funny)
OK, you read the headlines, now some FACTS (Score:1)
'City's IT Infrastructure Brought To Its Knees By Data Center Outage'
Incorrect!! Certain key public and private infrastructure systems were (and still are) housed at the Shaw Court data centre, yes. But the 'City's IT Infrastructure' was certainly ANYTHING BUT 'brought to its knees'. Simply not true, inflated, and blown way out of proportion.
'This took down a large swath of IT infrastructure, including Shaw's telephone and Internet customers'
Grossly overstated. 'Large swath' I can accept as the impact w
It has to be said (Score:1)
I used to be a City until l took an arrow to the knee.
Single Points of Failure (Score:2)
People often walk around with some very bad assumptions about how resilient the Internet or a Cloud must be.
You may have a very good internet presence with lots of bandwidth, but it may be all housed in the same building where the same sprinkler system can bring it all down. You may think that ISPs can reroute lots of traffic to other places because it is possible. Yet, there are common failure modes there too.
Cloud computing is often hailed as a very resilient method for infrastructure. Yet, there is a dis
IBM and their customers fault (Score:1)