Forgot your password?
typodupeerror
This discussion has been archived. No new comments can be posted.

City's IT Infrastructure Brought To Its Knees By Data Center Outage

Comments Filter:
  • First post! (Score:4, Informative)

    by Svartormr (692822) on Friday July 13, 2012 @05:36PM (#40643819)
    I use Telus. >:)
    • by clarkn0va (807617) <apt.get@gm[ ].com ['ail' in gap]> on Friday July 13, 2012 @05:40PM (#40643867) Homepage
      So Shaw customers get all their disappointment in one fell swoop, while you suffer subclinical abuse on an ongoing basis. Congrats.
      • by Anonymous Coward
        Thanks. I nearly got a hernia I laughed so hard at that. Seriously. You could say Telus are a of bunch of cunts, but cunts are useful.
      • by Anonymous Coward

        The unofficial offsite backup (the trunk of a certain station wagon) shall henceforth use the Telus' office parking lot.

    • I am in Calgary and use Shaw as my ISP. I did not have any internet downtime. Perhaps the damage was restricted certain servers while parts of their network were not affected?
      • Shaw court is an IBM datacenter. Many companies lost critical servers. We had a couple dozen there that are still coming back up. This didn't just affect shaw, but also big customers that pay big money for uptime.
        • by gen0c1de (977481)

          You don't pay big money to IBM for uptime, all IBM does now is resell other companies and take a share of the money. And become a middle man.

          • by BagOBones (574735)

            This is true, we just finished doing evaluations and IBMs quote included subbing out ALL the work to multiple sub vendors, the only part with IBMs name on it was the Quote.

          • by Dr Caleb (121505)

            Incorrect. I know many IBMers that have been restoring service to that datacentre for more than 50 hours, on 2 hours sleep. It sucks when you are doing it, but it's worth much geek cred in my book.

      • by dargon (105684)

        Shaw has 2 major locations in Calgary, the only people effected are those that use the downtown site

  • Or... (Score:4, Insightful)

    by Transdimentia (840912) on Friday July 13, 2012 @05:40PM (#40643859)
    ... it just points out what should be practical thought in that no matter how redundancies you build, you can never escape the (RMS) Titanic effect. So stop claiming stupidity.
    • by g0es (614709)
      Well it seems that they had the redundant systems in the same building. when designing redundant systems its best to avoid common mode failure when ever possible.
      • The designers AND their managers and their managers should be made redundant.
      • Irrelevant. The fire department became involved and that typically means you shut it all down if they say so (or they'll do it for you), even the redundant stuff that's still running. The only way around that is separate physical buildings.

      • by cusco (717999)
        There's a local municipality whose IT department was very proud of its redundant fiber ring. Then a backhoe pointed out the fact that all of the fibers, prod and redundant, were all in the same conduit. Oops.
    • Uhh, it is stupidity. Having your DR in the same site as your production servers is monumentally stupid. Most companies I've worked at have rules that state a minimum of 5km distance between production & DR sites in case of catastrophic failure. The Department of Defence here in Australia has 500km between their two production sites & DR. Our biggest service provider has 1000km between production & DR.

      The only time I have seen the same building used is for redundancy & then the two comms/ser

      • Two buildings eh? Like those US companies who had their main and backup systems in the two World Trade Centre Towers in NY. It sure helped them a lot...
  • by sociocapitalist (2471722) on Friday July 13, 2012 @05:40PM (#40643861)

    Whoever designed this should be smacked in the head. You never have critical services relying on a single location. Should have redundancy at every level, including geographic (ie not in the same flood / fault / fire zone).

    • Re: (Score:2, Informative)

      by Anonymous Coward

      The issue is IBM runs the Alberta Health Services and other infrastructure from the Shaw building of which IBM has their own datacenter in. IBM had no proper backups in place for these services.

      911 being the most critical was also not affected, just Shaw VoIP users couldn't call 911 if their lines were down -- obviously (only ~20k people downtown were affected).

    • by jtnix (173853)

      Add 'tornado zone' to that list.

      If you host all your cloud services at Rackspace in Texas and a tornado happens to rip apart their datacenter, well expect a few hours/days downtime. And you better have offsite backups of mission critical data or that's a long bet that is getting shorter every day.

      • by sumdumass (711423) on Friday July 13, 2012 @06:37PM (#40644435) Journal

        This is why i do not understand the rush to cloud space. The same types of outages that apply to locally hosting the data apply to the cloud space providers. You still need the backup's, disaster plans with the ability to access the servers and such, much of the same stuff if not more then you would need if hosting it yourself. Is the clouds that much cheaper or something? Or is it more about marketing hype that talks PHBs and supervisors who want to sound cool into situations like this where diligence is not necessarily a priority?

        • it's the warm body on the other end of the line who kisses your ass so well you can hardly yell at them for anything
        • Re: (Score:2, Insightful)

          by ahodgson (74077)

          The cloud is not cheaper, unless you're doing things really wrong in the first place, like buying tier 1 servers or running windows.

          It does provide economies of scale, can be somewhat cost-competitive with doing it yourself for at least some things, and you don't have to deal with hardware depreciation and the constant refresh cycle.

          The big cloud providers also integrate a lot of services that would be a pain to build internally for small and mid-sized clients.

          Hype explains the rest. PHBs are always looking

        • Arguably you could still use two different cloud providers after verifying (and continuing to verify over time) that the infrastructure (and connectivity to it) is actually redundant.

      • Texas and a tornado happens to rip apart their datacenter

        I've always wondered why stuff wasn't built in more 'secure' locations. Stuff like monolithic domes designed to take F5 tornadoes.

        • by drinkypoo (153816)

          I've always wondered why stuff wasn't built in more 'secure' locations. Stuff like monolithic domes designed to take F5 tornadoes.

          Mostly these days it's built in whatever they have available, because actually building anything costs too much. That's got to be responsible in large part for the rise of the shipping container as a data center... it's a temporary, soft-set structure. You only need a permit for the electrical connection, and maybe a pad.

        • by Ol Olsoc (1175323)

          I've always wondered why stuff wasn't built in more 'secure' locations. Stuff like monolithic domes designed to take F5 tornadoes.

          The cost is impressive.

  • by Anonymous Coward on Friday July 13, 2012 @05:49PM (#40643943)

    The issue with the city/provincial critical services is that they didn't have geographical redundancy due to the cost. Yes the building had redundant power, and networks but it was the whole building that was affected by this. At the end of the day, Shaw did fuck up, but all the essential servers completely fucked up.

    • The issue with the city/provincial critical services is that they didn't have geographical redundancy due to the cost. Yes the building had redundant power, and networks but it was the whole building that was affected by this. At the end of the day, Shaw did fuck up, but all the essential servers completely fucked up.

      Cost should not be an issue when we're talking about life or death critical services that are provided by some level of government. You spend what you have to spend to get the job done right, not more, not less. We're also not talking about a town with a population of 16 but a city with a population of 3,645,257 (in 2011). I am quite sure that they had the means to do this the right way and just chose not to.

  • All these other services lost their internet access, that is all. While I am sure in a perfect world all theses government services and companies would have had redundant internet connections, that is often prohibitively expensive.
    • by tlhIngan (30335)

      All these other services lost their internet access, that is all. While I am sure in a perfect world all theses government services and companies would have had redundant internet connections, that is often prohibitively expensive.

      Actually, Shaw's a media company - they do not only internet, but phones as well. Those went down as well (Shaw has business packages for phone service over cable, but downtown, I'd guess they also have fiber phone service too).

      And it really isn't a screwup - they were doing somet

    • by mysidia (191772)

      Redundant internet connections are no guarantee of no single points of failure.

      "Redundant" connections can sometimes wind up on the same fiber somewhere upstream, unbeknownst to the subscriber.

      Most telecommunications infrastructure in any area has some very large aggregation points also... Telco Central Offices; a single point of failure for telecommunications services served by that office.

      What good is working 911 service, if nobody can call in, because all their phones are rendered useless by

  • Putting all of ones eggs in one basket, all your reactors on one backup generator or all your data in one place is the reason for these catastrophic failures. Back 30 odd years ago we had mainframes, I used to operate an IBM 360, then along came the internet and the distributed computing model where the system didn't have all of its data in even one box in a company. There was a box on everyone's desktop. Now that's come full circle with the "Cloud" initiative where all your data is housed in one place (dat

  • There are limitations to how high your HA can be depending on the volume of data you process and the infrastructure available.

    In this case an entire building was knocked out by an exceptional circumstance. You can plan for that by having buildings in multiple sites, but as you get farther apart the connecting infrastructure gets more difficult. In this Shaw is an ISP (one of the big-boys in that part of the country), so in that case you'd expect that access to fast connections should be their forte. One thi

    • Re: (Score:2, Interesting)

      It's kind of funny for me to hear someone call Shaw one of the "big boys" after working on a number of Telecom projects in the USA. In the scheme of things Shaw would only rate being maybe a tier 3 player. Their maximum customer base is maybe 8 million potential people .... not households (and I'm presupposing they are in Saskatchewan and Manitoba now otherwise subtract a couple million). And they compete with Telus and Manitoba Tel and Sasktel (or whoever it is there). That's definitely tier 3 or smaller.

  • by roc97007 (608802) on Friday July 13, 2012 @06:01PM (#40644107) Journal

    > No doubt this has been a hard lesson on how NOT to host critical public services.

    And no doubt the lesson was not learned.

  • Not surprising (Score:3, Informative)

    by Anonymous Coward on Friday July 13, 2012 @06:04PM (#40644135)

    There are buildings all over the US that can have a similar effect but worse. In Seattle it would be the Westin Tower, get the two electrical vaults in that building and you'll pretty much take most phone service, internet service and various emergency agency services all over the state offline for a while.

    What I now consider a classic example is the outage of Fischer plaza. It not only took down credit card processors, bing travel and a couple other big online services. It also took out Verizon's FiOS service for western washington.
    http://www.datacenterknowledge.com/archives/2009/07/03/major-outage-at-seattle-data-center/
    (apologies don't comment a lot and don't know how to properly link)

    The big problem is that many services no matter how redundant they may seem to be, now-in-days have a upstream geographic single point of failure (Ala my Westin tower example.)

  • Transformers sometimes fail catastrophically and without warning. Other than keeping transformers outside, such things simply fall under "shit happens". Then once the fire department gets involved you turn off all the power: your backup generators, your UPS, everything.

  • 911 service was not down, only customers using Shaw as their phone service provide were unable to access it via Shaw's phone service. People were asked to use cell phones to call 911 as an alternative. Sounds like the city's emergency plan was activated and followed, prioritizing and assessing critical services and leaving the other non-essentials offline. Very likely that's also what is deemed to have redundancy (those ones probably have more than one ISP) while non-essential services don't.
    • by Svartormr (692822)
      Well, I was standing in a hospital emergency several hours after the initial service loss, watching the staff fall back on paper systems. And many commonly used services, like finding out what medications a patient was on by checking a shared database used by pharmacists, were unavailable. No single event like this outage should have degraded all this services to uselessness.
  • by Anonymous Coward

    Our primary internet connection was Bell and Shaw was our backup. To our surprise Bell's downtown network relies on Shaw's backbone and was ultimately affected by this monumental single point of failure.

    To get back on the internet without having to fail-over to our DR site we came up with a crazy solution of hooking up a Rogers Rocket Hub. The damn thing worked without our ~85 employees and 3 remote users noticing a difference.

    Over the next few weeks we will be canceling all of our Shaw services, signing up

    • by gen0c1de (977481)

      Have fun with Enmax, their stability record is pretty awesome. Try getting any kind of service on a weekend, their noc number on the weekends goes to a pager and they will call you back within the hour. I deal with them so often it isn't funny. And watch out, they contract out the last mile in many of their build out. Their like using telus and shaw and the best, they won't tell you.

  • Sounds like the datacenter heard about this "cloud" thing and decided to give it a try.
  • by Anonymous Coward on Friday July 13, 2012 @06:18PM (#40644297)

    Shaw had a generator overheat and literally blow up which damaged their other 2 generators and caused an electrical arc fire. This fire set off the sprinklers and in turn, the water shut down the backup systems.

    Yes, it was stupid that Shaw housed all their critical systems, including backups, in one building but even more stupid was the fact that they used a water based sprinkler system in a bloody telecom room.

    Also, Alberta has this wonderful thing called Alberta SuperNet, which, if I recall, all health regions use to use before our government decided to spend hundreds of millions of dollars to merge everything together and spend even more money to use the Shaw network to connect everything. The SuperNet was specifically designed with government offices in mind but nooo, why use something you have already paid for when you can spend more money and use something different.

  • by Megahard (1053072) on Friday July 13, 2012 @06:23PM (#40644339)
    It caused a stampede.
  • by Anonymous Coward

    'City's IT Infrastructure Brought To Its Knees By Data Center Outage'
    Incorrect!! Certain key public and private infrastructure systems were (and still are) housed at the Shaw Court data centre, yes. But the 'City's IT Infrastructure' was certainly ANYTHING BUT 'brought to its knees'. Simply not true, inflated, and blown way out of proportion.

    'This took down a large swath of IT infrastructure, including Shaw's telephone and Internet customers'
    Grossly overstated. 'Large swath' I can accept as the impact w

  • I used to be a City until l took an arrow to the knee.

  • People often walk around with some very bad assumptions about how resilient the Internet or a Cloud must be.

    You may have a very good internet presence with lots of bandwidth, but it may be all housed in the same building where the same sprinkler system can bring it all down. You may think that ISPs can reroute lots of traffic to other places because it is possible. Yet, there are common failure modes there too.

    Cloud computing is often hailed as a very resilient method for infrastructure. Yet, there is a dis

  • Yes Shaw had an issue and there was some local neighbourhood services related to Shaw went down, but more to blame for the large outages is IBM for housing redundant systems in the same building or the government customers for buying a less than adequate redundancy solution. Always ask for a physical diagram in addition to the logical diagram :)

Swap read error. You lose your mind.

Working...