Become a fan of Slashdot on Facebook

City's IT Infrastructure Brought To Its Knees By Data Center Outage 102

Posted by Soulskill on Friday July 13, 2012 @05:32PM from the watch-out-for-that-first-explosion,-it's-a-doozy dept.

An anonymous reader writes "On July 11th in Calgary, Canada, a fire and explosion was reported at the Shaw Communications headquarters. This took down a large swath of IT infrastructure, including Shaw's telephone and Internet customers, local radio stations, emergency 911 services, provincial services such Alberta Health Services computers, and Alberta Registries. One news site reports that 'The building was designed with network backups, but the explosion damaged those systems as well.' No doubt this has been a hard lesson on how NOT to host critical public services."

This discussion has been archived. No new comments can be posted.

City's IT Infrastructure Brought To Its Knees By Data Center Outage

Load All Comments

Search 102 Comments Log In/Create an Account

Comments Filter:

First post! (Score:4, Informative)

by Svartormr ( 692822 ) writes: on Friday July 13, 2012 @05:36PM (#40643819)

I use Telus. >:)

Share
twitter facebook
- Re:First post! (Score:5, Funny)
  
  by clarkn0va ( 807617 ) writes: <apt.get@gmailPERIOD.com minus punct> on Friday July 13, 2012 @05:40PM (#40643867) Homepage
  
  So Shaw customers get all their disappointment in one fell swoop, while you suffer subclinical abuse on an ongoing basis. Congrats.
  
  Parent Share
  twitter facebook
  - Re: (Score:1)
    
    by Anonymous Coward writes:
    
    Thanks. I nearly got a hernia I laughed so hard at that. Seriously. You could say Telus are a of bunch of cunts, but cunts are useful.
  - Re: (Score:1)
    
    by Anonymous Coward writes:
    
    The unofficial offsite backup (the trunk of a certain station wagon) shall henceforth use the Telus' office parking lot.
- Re: (Score:1)
  
  by MrNickname ( 1918152 ) writes:
  
  I am in Calgary and use Shaw as my ISP. I did not have any internet downtime. Perhaps the damage was restricted certain servers while parts of their network were not affected?
  - Re: (Score:2)
    
    by snowraver1 ( 1052510 ) writes:
    
    Shaw court is an IBM datacenter. Many companies lost critical servers. We had a couple dozen there that are still coming back up. This didn't just affect shaw, but also big customers that pay big money for uptime.
    - Re: (Score:1)
      
      by gen0c1de ( 977481 ) writes:
      
      You don't pay big money to IBM for uptime, all IBM does now is resell other companies and take a share of the money. And become a middle man.
      - Re: (Score:2)
        
        by BagOBones ( 574735 ) writes:
        
        This is true, we just finished doing evaluations and IBMs quote included subbing out ALL the work to multiple sub vendors, the only part with IBMs name on it was the Quote.
      - Re: (Score:2)
        
        by Dr Caleb ( 121505 ) writes:
        
        Incorrect. I know many IBMers that have been restoring service to that datacentre for more than 50 hours, on 2 hours sleep. It sucks when you are doing it, but it's worth much geek cred in my book.
  - Re: (Score:1)
    
    by dargon ( 105684 ) writes:
    
    Shaw has 2 major locations in Calgary, the only people effected are those that use the downtown site
Or... (Score:4, Insightful)

by Transdimentia ( 840912 ) writes: on Friday July 13, 2012 @05:40PM (#40643859)

... it just points out what should be practical thought in that no matter how redundancies you build, you can never escape the (RMS) Titanic effect. So stop claiming stupidity.

Share
twitter facebook
- Re: (Score:2)
  
  by g0es ( 614709 ) writes:
  
  Well it seems that they had the redundant systems in the same building. when designing redundant systems its best to avoid common mode failure when ever possible.
  - Re: (Score:3)
    
    by theshowmecanuck ( 703852 ) writes:
    
    The designers AND their managers and their managers should be made redundant.
  - Re: (Score:2)
    
    by Glendale2x ( 210533 ) writes:
    
    Irrelevant. The fire department became involved and that typically means you shut it all down if they say so (or they'll do it for you), even the redundant stuff that's still running. The only way around that is separate physical buildings.
    - in some buildings / data centers the fire system (Score:2)
      
      by Joe_Dragon ( 2206452 ) writes:
      
      in some buildings / data centers the fire system can kill most of the power
      - Re: (Score:2)
        
        by Chris Mattern ( 191822 ) writes:
        
        Which is a *good* thing. Fire and live electrical systems don't mix well.
  - Re: (Score:2)
    
    by cusco ( 717999 ) writes:
    
    There's a local municipality whose IT department was very proud of its redundant fiber ring. Then a backhoe pointed out the fact that all of the fibers, prod and redundant, were all in the same conduit. Oops.
- Re: (Score:2)
  
  by sortius_nod ( 1080919 ) writes:
  
  Uhh, it is stupidity. Having your DR in the same site as your production servers is monumentally stupid. Most companies I've worked at have rules that state a minimum of 5km distance between production & DR sites in case of catastrophic failure. The Department of Defence here in Australia has 500km between their two production sites & DR. Our biggest service provider has 1000km between production & DR.
  The only time I have seen the same building used is for redundancy & then the two comms/ser
  - - Re: (Score:1)
      
      by Anonymous Coward writes:
      
      Yeah, silly Australian Department of Defense with poor network redundancy planning -- if a meteor hits Australia and wipes it off the map there will be no backups available.
  - Re: (Score:2)
    
    by flyingfsck ( 986395 ) writes:
    
    Two buildings eh? Like those US companies who had their main and backup systems in the two World Trade Centre Towers in NY. It sure helped them a lot...
No Site Level Resiliency? (Score:5, Insightful)

by sociocapitalist ( 2471722 ) writes: on Friday July 13, 2012 @05:40PM (#40643861)

Whoever designed this should be smacked in the head. You never have critical services relying on a single location. Should have redundancy at every level, including geographic (ie not in the same flood / fault / fire zone).

Share
twitter facebook
- Re: (Score:2, Informative)
  
  by Anonymous Coward writes:
  
  The issue is IBM runs the Alberta Health Services and other infrastructure from the Shaw building of which IBM has their own datacenter in. IBM had no proper backups in place for these services.
  911 being the most critical was also not affected, just Shaw VoIP users couldn't call 911 if their lines were down -- obviously (only ~20k people downtown were affected).
  - Re: (Score:2)
    
    by Mike Buddha ( 10734 ) writes:
    
    It's IBM's fault that their customers didn't have a DR plan?
- Re: (Score:1)
  
  by jtnix ( 173853 ) writes:
  
  Add 'tornado zone' to that list.
  If you host all your cloud services at Rackspace in Texas and a tornado happens to rip apart their datacenter, well expect a few hours/days downtime. And you better have offsite backups of mission critical data or that's a long bet that is getting shorter every day.
  - Re:No Site Level Resiliency? (Score:4, Insightful)
    
    by sumdumass ( 711423 ) writes: on Friday July 13, 2012 @06:37PM (#40644435) Journal
    
    This is why i do not understand the rush to cloud space. The same types of outages that apply to locally hosting the data apply to the cloud space providers. You still need the backup's, disaster plans with the ability to access the servers and such, much of the same stuff if not more then you would need if hosting it yourself. Is the clouds that much cheaper or something? Or is it more about marketing hype that talks PHBs and supervisors who want to sound cool into situations like this where diligence is not necessarily a priority?
    
    Parent Share
    twitter facebook
    - Re: (Score:2)
      
      by Eponymous Hero ( 2090636 ) writes:
      
      it's the warm body on the other end of the line who kisses your ass so well you can hardly yell at them for anything
    - Re: (Score:2, Insightful)
      
      by ahodgson ( 74077 ) writes:
      
      The cloud is not cheaper, unless you're doing things really wrong in the first place, like buying tier 1 servers or running windows.
      It does provide economies of scale, can be somewhat cost-competitive with doing it yourself for at least some things, and you don't have to deal with hardware depreciation and the constant refresh cycle.
      The big cloud providers also integrate a lot of services that would be a pain to build internally for small and mid-sized clients.
      Hype explains the rest. PHBs are always looking
    - Re: (Score:2)
      
      by sociocapitalist ( 2471722 ) writes:
      
      Arguably you could still use two different cloud providers after verifying (and continuing to verify over time) that the infrastructure (and connectivity to it) is actually redundant.
  - Re: (Score:2)
    
    by 0100010001010011 ( 652467 ) writes:
    
    Texas and a tornado happens to rip apart their datacenter
    I've always wondered why stuff wasn't built in more 'secure' locations. Stuff like monolithic domes designed to take F5 tornadoes.
    - Re: (Score:2)
      
      by drinkypoo ( 153816 ) writes:
      
      I've always wondered why stuff wasn't built in more 'secure' locations. Stuff like monolithic domes designed to take F5 tornadoes.
      Mostly these days it's built in whatever they have available, because actually building anything costs too much. That's got to be responsible in large part for the rise of the shipping container as a data center... it's a temporary, soft-set structure. You only need a permit for the electrical connection, and maybe a pad.
    - Re: (Score:2)
      
      by Ol Olsoc ( 1175323 ) writes:
      
      I've always wondered why stuff wasn't built in more 'secure' locations. Stuff like monolithic domes designed to take F5 tornadoes.
      The cost is impressive.
- - Re: (Score:3)
    
    by Sir_Sri ( 199544 ) writes:
    
    Is everybody safe
    That is, quite literally, someone else's problem. It sounds calloused to say, but seriously, /. isn't a site for first responders, it's for IT and CS types. It's not like we're looking at one thing at the expense of another here, your data (and 911 access) should work, people shouldn't die in a fire and your data shouldn't be hosed if it was housed there.
    As to your point about universities. As tragic as it might be if someone died in a fire tomorrow at the university I graduated from 10 years ago, I stil
  - Re: (Score:2)
    
    by Eponymous Hero ( 2090636 ) writes:
    
    it does not remain a question, you just didn't RTFA. why does no one care about the safety of people in the explosion?
    No one was hurt when a blast in a 13th-floor electrical room on Wednesday brought down Alberta Health Services computers, put three radio stations off the air and affected some banking services.
    because no one was hurt, you fucking chicken little. if you gave a shit at all about "the people" you would have RTFA to find out. go ahead and read it. i hope it brings stuff into perspective for you.
  - Re: (Score:3, Interesting)
    
    by foradoxium ( 2446368 ) writes:
    
    Imagine if the library of Alexandria had backup copies of all those books, manuscripts and other treasures? How about Constantinople? I'm sure there were people that tried to protect that data who believed it was worth more then their life. I hope that brings stuff into perspective.
- - Re: (Score:2)
    
    by JustOK ( 667959 ) writes:
    
    it's boni
Maybe the city/provinces should skip on redundancy (Score:3, Interesting)

by Anonymous Coward writes: on Friday July 13, 2012 @05:49PM (#40643943)

The issue with the city/provincial critical services is that they didn't have geographical redundancy due to the cost. Yes the building had redundant power, and networks but it was the whole building that was affected by this. At the end of the day, Shaw did fuck up, but all the essential servers completely fucked up.

Share
twitter facebook
- - Re: (Score:3)
    
    by snowraver1 ( 1052510 ) writes:
    
    The problem wasn't necessarily with Shaw. Shaw's problems were relatively minor. Internet and television services were affected over a small geographic area (downtown Calgary). Those affected by the Internet outage who also had Shaw Home Phone, couldn't use their phone as the network was down. If they called 911 on a cell phone or a land line, they would have received help.
    
    The real problem was with the datacenter that is housed in the same building. 20,000 consumer class Internet outages is nothing c
    - Re: (Score:3)
      
      by snowraver1 ( 1052510 ) writes:
      
      Oh yeah, I also heard that IBM will be incurring HUGE fines from SLAs. I think I heard some obscene number like 1M/minute.
- Re: (Score:2)
  
  by sociocapitalist ( 2471722 ) writes:
  
  The issue with the city/provincial critical services is that they didn't have geographical redundancy due to the cost. Yes the building had redundant power, and networks but it was the whole building that was affected by this. At the end of the day, Shaw did fuck up, but all the essential servers completely fucked up.
  Cost should not be an issue when we're talking about life or death critical services that are provided by some level of government. You spend what you have to spend to get the job done right, not more, not less. We're also not talking about a town with a population of 16 but a city with a population of 3,645,257 (in 2011). I am quite sure that they had the means to do this the right way and just chose not to.
Shaw is an ISP (Score:2)

by Capt.DrumkenBum ( 1173011 ) writes:

All these other services lost their internet access, that is all. While I am sure in a perfect world all theses government services and companies would have had redundant internet connections, that is often prohibitively expensive.
- Re: (Score:3)
  
  by tlhIngan ( 30335 ) writes:
  
  All these other services lost their internet access, that is all. While I am sure in a perfect world all theses government services and companies would have had redundant internet connections, that is often prohibitively expensive.
  Actually, Shaw's a media company - they do not only internet, but phones as well. Those went down as well (Shaw has business packages for phone service over cable, but downtown, I'd guess they also have fiber phone service too).
  And it really isn't a screwup - they were doing somet
- Re: (Score:3)
  
  by mysidia ( 191772 ) writes:
  
  Redundant internet connections are no guarantee of no single points of failure.
  "Redundant" connections can sometimes wind up on the same fiber somewhere upstream, unbeknownst to the subscriber.
  Most telecommunications infrastructure in any area has some very large aggregation points also... Telco Central Offices; a single point of failure for telecommunications services served by that office.
  What good is working 911 service, if nobody can call in, because all their phones are rendered useless by
Fukushima Daiichi - Anyone? (Score:2)

by Paleolibertarian ( 930578 ) writes:

Putting all of ones eggs in one basket, all your reactors on one backup generator or all your data in one place is the reason for these catastrophic failures. Back 30 odd years ago we had mainframes, I used to operate an IBM 360, then along came the internet and the distributed computing model where the system didn't have all of its data in even one box in a company. There was a box on everyone's desktop. Now that's come full circle with the "Cloud" initiative where all your data is housed in one place (dat
- - Re: (Score:1)
    
    by Paleolibertarian ( 930578 ) writes:
    
    Actually the backup generators were located in the basement as per GE's original design. Engineers requested to locate the generators in a more secure (from tsunamis) location when the plants were built but were overruled by upper management.
    However the location of the generators at Fukushima is irrelevant to my point that had the power grid been a distributed system with local batteries or what have you then all residents would not have lost power when the plant was flooded. This is not an argument about t
Limitations (Score:2)

by phorm ( 591458 ) writes:

There are limitations to how high your HA can be depending on the volume of data you process and the infrastructure available.
In this case an entire building was knocked out by an exceptional circumstance. You can plan for that by having buildings in multiple sites, but as you get farther apart the connecting infrastructure gets more difficult. In this Shaw is an ISP (one of the big-boys in that part of the country), so in that case you'd expect that access to fast connections should be their forte. One thi
- Re: (Score:2, Interesting)
  
  by theshowmecanuck ( 703852 ) writes:
  
  It's kind of funny for me to hear someone call Shaw one of the "big boys" after working on a number of Telecom projects in the USA. In the scheme of things Shaw would only rate being maybe a tier 3 player. Their maximum customer base is maybe 8 million potential people .... not households (and I'm presupposing they are in Saskatchewan and Manitoba now otherwise subtract a couple million). And they compete with Telus and Manitoba Tel and Sasktel (or whoever it is there). That's definitely tier 3 or smaller.
Captain Obvious (Score:3)

by roc97007 ( 608802 ) writes: on Friday July 13, 2012 @06:01PM (#40644107) Journal

> No doubt this has been a hard lesson on how NOT to host critical public services.
And no doubt the lesson was not learned.

Share
twitter facebook
- - Re: (Score:2)
    
    by roc97007 ( 608802 ) writes:
    
    Yeah, if it was entirely government owned, it'd be rock solid and cheaper, too.
Not surprising (Score:3, Informative)

by Anonymous Coward writes: on Friday July 13, 2012 @06:04PM (#40644135)

There are buildings all over the US that can have a similar effect but worse. In Seattle it would be the Westin Tower, get the two electrical vaults in that building and you'll pretty much take most phone service, internet service and various emergency agency services all over the state offline for a while.
What I now consider a classic example is the outage of Fischer plaza. It not only took down credit card processors, bing travel and a couple other big online services. It also took out Verizon's FiOS service for western washington.
http://www.datacenterknowledge.com/archives/2009/07/03/major-outage-at-seattle-data-center/
(apologies don't comment a lot and don't know how to properly link)
The big problem is that many services no matter how redundant they may seem to be, now-in-days have a upstream geographic single point of failure (Ala my Westin tower example.)

Share
twitter facebook
Transformer fire? (Score:2)

by Glendale2x ( 210533 ) writes:

Transformers sometimes fail catastrophically and without warning. Other than keeping transformers outside, such things simply fall under "shit happens". Then once the fire department gets involved you turn off all the power: your backup generators, your UPS, everything.
911 not down (Score:2)

by CaptainPuff ( 323270 ) writes:

911 service was not down, only customers using Shaw as their phone service provide were unable to access it via Shaw's phone service. People were asked to use cell phones to call 911 as an alternative. Sounds like the city's emergency plan was activated and followed, prioritizing and assessing critical services and leaving the other non-essentials offline. Very likely that's also what is deemed to have redundancy (those ones probably have more than one ISP) while non-essential services don't.
- Re: (Score:2)
  
  by Svartormr ( 692822 ) writes:
  
  Well, I was standing in a hospital emergency several hours after the initial service loss, watching the staff fall back on paper systems. And many commonly used services, like finding out what medications a patient was on by checking a shared database used by pharmacists, were unavailable. No single event like this outage should have degraded all this services to uselessness.
Not just Shaw's network was affected (Score:1)

by Anonymous Coward writes:

Our primary internet connection was Bell and Shaw was our backup. To our surprise Bell's downtown network relies on Shaw's backbone and was ultimately affected by this monumental single point of failure.
To get back on the internet without having to fail-over to our DR site we came up with a crazy solution of hooking up a Rogers Rocket Hub. The damn thing worked without our ~85 employees and 3 remote users noticing a difference.
Over the next few weeks we will be canceling all of our Shaw services, signing up
- Re: (Score:1)
  
  by gen0c1de ( 977481 ) writes:
  
  Have fun with Enmax, their stability record is pretty awesome. Try getting any kind of service on a weekend, their noc number on the weekends goes to a pager and they will call you back within the hour. I deal with them so often it isn't funny. And watch out, they contract out the last mile in many of their build out. Their like using telus and shaw and the best, they won't tell you.
Poof! (Score:2)

by Antipater ( 2053064 ) writes:

Sounds like the datacenter heard about this "cloud" thing and decided to give it a try.
What really happened... (Score:4, Interesting)

by Anonymous Coward writes: on Friday July 13, 2012 @06:18PM (#40644297)

Shaw had a generator overheat and literally blow up which damaged their other 2 generators and caused an electrical arc fire. This fire set off the sprinklers and in turn, the water shut down the backup systems.
Yes, it was stupid that Shaw housed all their critical systems, including backups, in one building but even more stupid was the fact that they used a water based sprinkler system in a bloody telecom room.
Also, Alberta has this wonderful thing called Alberta SuperNet, which, if I recall, all health regions use to use before our government decided to spend hundreds of millions of dollars to merge everything together and spend even more money to use the Shaw network to connect everything. The SuperNet was specifically designed with government offices in mind but nooo, why use something you have already paid for when you can spend more money and use something different.

Share
twitter facebook
- Re: (Score:2)
  
  by corychristison ( 951993 ) writes:
  
  Halon has been banned for quite some time now. The replacement, Halotron, was just recently restricted.
  CO2 would make the most sense.
  The problem from what I understand was the generator room. A CO2 system would be ideal in this situation, you're dealing with lubricants as well and foam just makes an awful mess.
  What I dont understand is how the sprinkler system was involved at all. When that valve bursts, it only flows at th affected area. Its not like the movies where if one pops the whole system goes off.
  I
It was so bad.. (Score:5, Funny)

by Megahard ( 1053072 ) writes: on Friday July 13, 2012 @06:23PM (#40644339)

It caused a stampede.

Share
twitter facebook
OK, you read the headlines, now some FACTS (Score:1)

by Anonymous Coward writes:

'City's IT Infrastructure Brought To Its Knees By Data Center Outage'
Incorrect!! Certain key public and private infrastructure systems were (and still are) housed at the Shaw Court data centre, yes. But the 'City's IT Infrastructure' was certainly ANYTHING BUT 'brought to its knees'. Simply not true, inflated, and blown way out of proportion.
'This took down a large swath of IT infrastructure, including Shaw's telephone and Internet customers'
Grossly overstated. 'Large swath' I can accept as the impact w
It has to be said (Score:1)

by Phibz ( 254992 ) writes:

I used to be a City until l took an arrow to the knee.
Single Points of Failure (Score:2)

by AB3A ( 192265 ) writes:

People often walk around with some very bad assumptions about how resilient the Internet or a Cloud must be.
You may have a very good internet presence with lots of bandwidth, but it may be all housed in the same building where the same sprinkler system can bring it all down. You may think that ISPs can reroute lots of traffic to other places because it is possible. Yet, there are common failure modes there too.
Cloud computing is often hailed as a very resilient method for infrastructure. Yet, there is a dis
IBM and their customers fault (Score:1)

by phizman ( 742537 ) writes:

Yes Shaw had an issue and there was some local neighbourhood services related to Shaw went down, but more to blame for the large outages is IBM for housing redundant systems in the same building or the government customers for buying a less than adequate redundancy solution. Always ask for a physical diagram in addition to the logical diagram :)

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

First post! (Score:4, Informative)

Re:First post! (Score:5, Funny)

Re: (Score:1)

Re: (Score:1)

Re: (Score:1)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

Or... (Score:4, Insightful)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

in some buildings / data centers the fire system (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

No Site Level Resiliency? (Score:5, Insightful)

Re: (Score:2, Informative)

Re: (Score:2)

Re: (Score:1)

Re:No Site Level Resiliency? (Score:4, Insightful)

Re: (Score:2)

Re: (Score:2, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:3, Interesting)

Re: (Score:2)

Maybe the city/provinces should skip on redundancy (Score:3, Interesting)

Re: (Score:3)

Re: (Score:3)

Re: (Score:2)

Shaw is an ISP (Score:2)

Re: (Score:3)

Re: (Score:3)

Fukushima Daiichi - Anyone? (Score:2)

Re: (Score:1)

Limitations (Score:2)

Re: (Score:2, Interesting)

Captain Obvious (Score:3)

Re: (Score:2)

Not surprising (Score:3, Informative)

Transformer fire? (Score:2)

911 not down (Score:2)

Re: (Score:2)

Not just Shaw's network was affected (Score:1)

Re: (Score:1)

Poof! (Score:2)

What really happened... (Score:4, Interesting)

Re: (Score:2)

It was so bad.. (Score:5, Funny)

OK, you read the headlines, now some FACTS (Score:1)

It has to be said (Score:1)

Single Points of Failure (Score:2)

IBM and their customers fault (Score:1)

Related Links Top of the: day, week, month.

Slashdot Top Deals