Forgot your password?
typodupeerror
Cloud IT

Amazon Outage Shows Limits of Failover 'Zones' 125

Posted by timothy
from the my-cloud-smells-like-cat-food dept.
jbrodkin writes "For cloud customers willing to pony up a little extra cash, Amazon has an enticing proposition: Spread your application across multiple availability zones for a near-guarantee that it won't suffer from downtime. 'By launching instances in separate Availability Zones, you can protect your applications from failure of a single location,' Amazon says in pitching its Elastic Compute Cloud service. But the availability zones are close together and can fail at the same time, as we saw today. The outage and ongoing attempts to restore service call into question the effectiveness of the availability zones, and put a spotlight on Amazon's failure to provide load balancing between the east and west coasts."
This discussion has been archived. No new comments can be posted.

Amazon Outage Shows Limits of Failover 'Zones'

Comments Filter:
    • by calderra (1034658)
      So setting up a server to remote desktop into your home computer is cloud computing? I have a sinking feeling that "cloud computing" is a lot like web2.0, aka "broadband Geocities".
      • Cloud Computing generally implies redundancy and non-locality 24/7. Computer hardware that makes up the cloud would normally be provisioned to acts as a resource and not a point of failure for the entire infrastructure. The idea with Cloud Computing is that the Cloud is an organ while the hardware acts as cells. A few could die off and/or be replaced without any disruption to the user.

        Unfortunately, everyone has their own idea and implementation to creating Cloud based content and services. So we end up wit

        • The incident might be eye opening for some people but the cloud cannot theoretically work because it's not a paradigm. Grid computing is a paradigm. Cloud computing is, as you said, marketspeak describing how providers organize their resources internally. Well that's irrelevant because the provider is the single point of failure. Piss off Amazon for whatever reason, your data becomes unavailable no matter how cloudy it was. It's more "cloudy" to simply replicate data locally and on two different providers.

          • the cloud cannot theoretically work because it's not a paradigm.

            What the fuck is that supposed to mean? It's got words in it, but is entirely vacuous.

            • He's wrong about Cloud not being able to work. But I think I understand his POV. If I'm right, he's basically saying what I've stated. That is to say, Cloud computing is a business solution based idea with the word coined for a marketing purpose. However, Cloud computing is not required to use any specific paradigm to achieve that goal. Grid computing is such a paradigm, and I believe it to be the proper one to use for Cloud computing.

              • by sorak (246725)

                Sorry if I'm dumbing it down too much, but are you arguing that Grid computing is a specific implementation to a technical problem, while cloud computing is a marketing solution to a business problem?

        • by dkf (304284)

          Cloud Computing generally implies redundancy and non-locality 24/7.

          No. It generally implies that you can hire resources (cpu, disk) on short notice and for short amounts of time without costing the earth. You can build high-availability systems on top of that, but HA is not trivial to set up and typically requires significant investment at many levels (hardware, system, application) to attain. Pretend that you can get away with less if you want; I don't care.

      • No need to have a sinking feeling, it's always been that way. The "Cloud" is a buzzword, nothing more, nothing less.

  • by Anonymous Coward on Thursday April 21, 2011 @04:08PM (#35898892)

    Amazon should put their cloud in a cloud, so the cloud will have the redundancy of the cloud.

    • by Lehk228 (705449)
      Yo dawg we heard you liked clouds
    • by Anonymous Coward

      That's funny. But this whole thing is funny. What the blazes is the cloud for, if it fails? Is it not a cloud but a lead balloon? Sure, I understand no system is perfect, but to steal a line from Seinfeld (you're a car rental agency - you're supposed to *hold* the reservation): Amazon WS - you are a cloud! You are supposed to have 99.99% uptime! That's *all* you are supposed to do! Especially when mainframes have 99.9999% uptime, I believe.
      Even distributed systems - which your average web s

    • by aled (228417)

      Amazon should put their cloud in a cloud, so the cloud will have the redundancy of the cloud.

      Wrong! You need to put the cloud in a 'virtual' cloud!

  • by stopacop (2042526) on Thursday April 21, 2011 @04:08PM (#35898906) Homepage
    Not ready for the desktop ;-)
  • I lose access to TWC's DNS servers regularly (yes, I will be setting up my own, when it becomes annoying enough). Although you can do a quick-and-dirty load-balancing by setting them up as follows, there's no redundancy for the customers when there's a link failure.

    search socal.rr.com
    nameserver 209.18.47.61
    nameserver 209.18.47.62

    • or just use 4.2.2.1 and 4.2.2.2.. or 8.8.8.8 and 8.8.4.4.

      • by DaftDev (1864598)
        Or OpenDNS: 208.67.222.222 208.67.220.220
        • Re: (Score:3, Informative)

          by samkass (174571)

          ...and get slow performance on anything delivered via Akamai or similar services which try to use regional data centers.

          OpenDNS and Google DNS are hacks that work increasingly badly.

          • by trapnest (1608791) <janusofzeal@gmail.com> on Thursday April 21, 2011 @05:05PM (#35899750)
            Not that you're wrong, but that's not the fault of the DNS servers, Akamai should be using geolocation by IP, not by the location of DNS servers.
            Infact, I'm not sure how they could be doing geolocation by the client's DNS servers... are you sure about that?
            • by Anonymous Coward

              I'm personally completely sure. I run a recursive DNS server at work for DNS lookups, and I get very different answers for www.akamai.com when I manually query 8.8.8.8 vs our own recursive DNS server.

              It doesn't help that a RTT to the IP returned by 8.8.8.8 is over 200 ms, but the latter is around 10 ms. (I'm in New Zealand, 200 ms RTTs to popular, US based, websites is very normal. Heck, I have 228 ms RTT pinging slashdot.org right now on my home DSL. Clearly, using our own recursive DNS, I'm hitting an

            • by The Bean (23214)

              Typically your computer asks your firewall/router for a DNS lookup. It relays that to your ISP's DNS server. Your ISP looks up the DNS server responsible for the domain and contacts that server and sends your original request. That request doesn't include your IP however, so Akamai's DNS servers are returning regional specific servers based on your ISP's DNS server IP/geo-location. That's usually perfectly acceptable, since presumably your ISP's DNS server would be located on a good route with a low pin

          • by guruevi (827432)

            Actually, when you're on TWC you might get BETTER performance with OpenDNS than with their own DNS. When using the TWC DNS I can't get a 1080p without 10m of loading time or even a non-stuttering 720p stream from YouTube or Netflix. With OpenDNS or Google DNS I get much better performance. Also, if you're an AT&T Business customer, OpenDNS works much better with DNS-based RBL's like Spamhaus which AT&T blocks.

          • by Anonymous Coward

            No. DNS-based geo-location caching schemes are the culprit. It works off a bad assumption that makes using an alternative DNS server a pain. I Don't like my ISP's DNS servers. They hijack domain typos as a revenue stream, so I consider them hostile and ignore them when I can.

            Using google or opendns, however, will cause havoc for a couple of surprisingly common things I've experienced problems with:
            Akamai
            Itunes
            Hotmail
            Rackspace hosted exchange service
            Netflix
            Youtube

            Fortunately you can configure your network's

          • I don't believe what you say is true, at least not anymore.

            I'm in Malaysia, a bandwidth-constrained country 200ms from the USA. Using the wrong CDN node makes a huge difference.

            When I use 8.8.8.8 to find www.apple.com, I get e3191.c.akamaiedge.net, which is 17ms from my house. That's as good as it's going to get.

            Perhaps Google has started using source IPs for its DNS queries that match the client's location?

          • Sorry, I should have included the IP, since that's the location-sensitive part. I get e3191.c.akamaiedge.net as 118.215.101.15.
  • by Dan667 (564390) on Thursday April 21, 2011 @04:13PM (#35898978)
    or use a completely different company for redundancy. I think that is the lesson here.
    • by rudy_wayne (414635) on Thursday April 21, 2011 @04:35PM (#35899316)

      This incident illustrates once again why you need to put your stuff on your own servers and not someone else's. All computer systems will fail occasionally. There's no such thing as 100% uptime. However, when your own servers fail you can get your own people working on it right away and it's their number one priority. When your stuff is on someone else's servers, you're at their mercy. It will get fixed when they get around to it, and, they have more customers than just you, so you might not be first on the priority list. Or second. Or third. Or tenth.

      • by DdJ (10790)

        This incident illustrates once again why you need to put your stuff on your own servers and not someone else's.

        Well. Or put your stuff on your own servers as well as someone else's. Cloning your services into various clouds isn't insane as a tool for handling some types of unplanned scaling requirements or some types of unplanned outages. Relying on those clouds introduces risks that were just demonstrated.

        • by vrmlguy (120854)

          This incident illustrates once again why you need to put your stuff on your own servers and not someone else's.

          Well. Or put your stuff on your own servers as well as someone else's. Cloning your services into various clouds isn't insane as a tool for handling some types of unplanned scaling requirements or some types of unplanned outages. Relying on those clouds introduces risks that were just demonstrated.

          It's probably worth noting that EMC makes a cloud storage product called Atmos with an API essentially identical to Amazon's S3 service. The main difference is that the HTTP headers start with x-emc instead of x-amz, so a properly written application running on non-Amazon servers could switch fairly easily between the two for load balancing or redundancy.

      • ...or, have your stuff on the same servers with their Most Important Customer.

        xD

      • by dkf (304284)

        This incident illustrates once again why you need to put your stuff on your own servers and not someone else's.

        Hosting everything yourself? Can we sell you a contract for us to build you a datacenter? Then there's the ongoing costs of actually operating it.

        Or were you thinking that a scavenged rack in a old closet previously only used by the janitor was a substitute?

      • by hey! (33014)

        Nah. It shows that when you buy a product or service you need to understand what you are paying for, not extrapolate from a buzzword like "cloud".

        You can't make a blanket statement one way or another about using something like EC2 without considering the user's needs and capabilities. There may be users who'd find the recent outage intolerable ;they probably shouldn't be using EC2. But if they have good reasons to consider EC2 chances are they are goig to spend more money.

      • Pop quiz:
        Youre a small company that does software development. You need servers to do deployment testing, basically just apache and the customized package. Uptime is a must, and your budget is limited.

        Do you...
        A) Spend tens of thousands on servers, plus backup power, plus racks, plus redundant switches, plus dual WAN links, plus a backup solution (for 10 servers, so far youre looking at ~$35k, plus a thousand a month on WAN links)
        or
        B) Trust that Amazon will have FAR better uptime than you could EVER dream

        • by Anonymous Coward

          Really, it depends on the financial hit your organization will take by downtime/lost productivity/lost business/lost confidence in the ability of your organization to be able to deliver your product. If being down for 12-15 hours or more will cost you more than $35k then yes it makes sense for you to roll your own solution or to use traditional dedicated hosting providers in a H/A configuration. Every organization needs to perform their own risk analysis. If downtime that is out of your control is accept

        • by The Bean (23214)

          Reddit's downtime has been a bit of a running joke for a while now, which most (all?) of it being blamed on Amazon.

          The way they implemented things is one of the big issues. For example, things like setting up RAID volumes across multiple EBS volumes. They just magnified their exposure to any issues in the cloud. Any one machine goes down the system gets hosed and needs recovery. They also are constrained to a single availability zone in order to get the performance they need from their setup. (This is n

          • Basically what youre saying is you cant just throw the "cloud" around like its a magical fix-all; and thats true. But every time one of these "big company goes down" stories arises, people seem to take that as proof that the cloud is not useful for anything, and I would challenge that assertion. There are a number of times where you need to rapidly expand, or where you need good uptime and scalability but dont havea big budget; and for that, the cloud really shines.

      • No. If a zone goes down in CA, I can have a new server up in Virginia within minutes. I would rather be on ec2 when I go down. I guarantee I will be back up faster than you.

      • by craigbeat (706827)
        The company I work for hosts on another very large company (that had a lot of downtime for another reason a few years back), on dedicated servers. Believe me when I say we have as many problems with them. So far, there have been no problems for us using Amazon. I think it depends on your needs. Multiple redundancy is probably a better solution, but nothing is perfect yet.
    • by gad_zuki! (70830) on Thursday April 21, 2011 @04:47PM (#35899480)

      So wait. The cloud sales pitch is "no more servers-save money-cut IT staff" but now its:

      1. Virtualized servers in zone 1
      2. Virtualized servers in zone 2
      3. Virtualized servers from a different company altogether.

      So I went from one solid server, good backups, maybe a hot backup, and talented staff running the show to outsourced to 3 different clouds with hour-long hold times with some Amazon support monkey? Genius.

      • One? Lose a raid stack and you're toast. It's always been at least N+1 redundancy for your tier one crap. The cloud stuff is there so you can scale up quickly. Shouldn't be base load or anything.
      • by Artifex (18308)

        So I went from one solid server, good backups, maybe a hot backup, and talented staff running the show to outsourced to 3 different clouds with hour-long hold times with some Amazon support monkey? Genius.

        I hope that one good server is in a disparate geographical location from its hot backup, using a separate transit provider, each server has redundant power supplies, and your talent has a bus factor of (#servers)+1 or more. You're gonna need backup for any load balancing as well, and whether that should be in yet another location is, well, something to consider.

        Cloud services should give you the redundancy you need, as well as being easily scalable. Why are you trying to say the whole concept is bad just be

        • by dhasenan (758719)
          I would expect Amazon's marketing to indicate that these units within a region are a way to get fast communication between them without wholly losing redundancy. As such, it's a middle-tier option, not best at anything (you'd have the machines in the same data center if they really needed the bandwidth, and in separate regions if you really needed the redundancy). If I'm wrong about that, then the marketing people who handled that should be dismissed.
        • just because Amazon's implementation is flawed?

          Which provider's implementation is not flawed?

    • by fermion (181285)
      If one can afford that kind of redundancy, then sure. Two independent lines coming in from two independent providers that individually will adequately handle all traffic for an extended period of time. Independent arrays of pc computers hooked to independent load balancers that will not fall over if something happens to one line or a large numer of computers. One could also have big iron with a 6 nine reliability hooked to redundant lines. In any case backup power to keep all the equipment up for a long p
      • If one can afford that kind of redundancy, then sure. Two independent lines coming in from two independent providers that individually will adequately handle all traffic for an extended period of time.

        Why would you do that? It's enough to do things like run two DCs that can each handle 60% load or three that each handle 40% load. Not that much more expensive, and downtime turns into "the site is slow". There are architectural concerns, especially with data replication, but this is definitely doable, and it doesn't cost a mint.

  • philosophical POV (Score:2, Insightful)

    by Anonymous Coward

    I'll take the philosophical point of view on this and say failures are the best way to find and diagnose systemic weaknesses. Now Amazon knows the weakness in the AZs and can fix it.

  • and only heard, "Cloud, cloud, cloud! It's new and shiny and cheaper than those annoying internal IT guys so I get a bonus!" learn to pay the stupidity tax.

    Next up, learning just how *much* of your cloud data has been stolen and resold by those trustworthy souls in China and India.

    Cheers!

    • Outsource IT and you outsource responsiblity as well. If your own department fucks up, the top brass will come looking for you. However, If you outsource and the service provider messes up, you can shift the blame to them especially in case of big disasters like these. As long as you can show that you've managed the SLA's well and that it's them who didn't keep to their promises, you're good. More likely you'll find that those SLA's were crap to begin with, which is also fine, because it's likely your b
    • by Tackhead (54550)

      And thus the gullible managers who ignored IT... and only heard, "Cloud, cloud, cloud! It's new and shiny and cheaper than those annoying internal IT guys so I get a bonus!" learn to pay the stupidity tax.

      C'mon. All managers love cloud!

      What rolls down stairs, fails over in pairs,
      Leaks data when it's allowed?
      A stupidity tax, it replaces your racks,
      It's cloud, cloud, cloud!

      It's cloud! It's cloud! It's new, it's shiny, it's cheap!
      It's cloud! It's cloud! It's down, and now you'll weep.

      Everything's

    • by xtracto (837672)

      It is a sad joke. Even for sites like Reddit whose administrators are supposed to know better, the Amazon shit hit. And the terrible thing is that it is not the first time that Amazon's service has broken, this has happened quite a lot in the last months, and people still *pay* for the service. Crazy.
       

    • and only heard, "Cloud, cloud, cloud! It's new and shiny and cheaper than those annoying internal IT guys so I get a bonus!" learn to pay the stupidity tax.

      Next up, learning just how *much* of your cloud data has been stolen and resold by those trustworthy souls in China and India.

      Cheers!

      As a developer from India, I can tell you India has absolutely nothing to do with the cloud.. thats a US revolution.. we are still years behind..

  • maybe the failure was on purpose to promote another revenue stream. Hmmmmmmmmm......................
    • Re: (Score:3, Insightful)

      by elohel (1582481)
      Okay, I had to log in simply to comment on the stupidity of this statement. Aside from now being in violation of their own ToS (probably, at least in transgression of up-time guarantees), they're undoubtedly fiscally liable for refunding payment for the period of time in which services were unavailable or degraded. Additionally, this dramatically hurts their brand name - I know if I ever have to host anything on 'the cloud' (I can't believe I said it), this incident will be on my mind when the time comes f
      • by klui (457783)

        Reddit has been down for approx 24 hours--it's been on RO mode for most of the day. Pretty bad PR for Amazon.

  • Do you think Amazon would allow its own sales and services to be impacted for 12 hours (and running) under any circumstances short of the recent disaster in Japan? EC2 customers, on the other hand, appear to be second-class citizens.

  • "Availibility Zones", "Failure Domains", etc. must be done with absolute perfection if you do them at all. If your gargantuan application has some single tiny side-feature that is not replicated across domains, your whole app is going down.

    True Story: I was doing some consulting work for a large bank after they had a bunch of problems. Their main website had all the super-available trimmings: Oracle RAC, mutli-site server clustering, storage mirroring, all the fancy, expensive, highly-available crap you

    • You could code your application to be tolerant of those kinds of outages. The services backing feature X aren't available? Then don't render the controls for feature X on the page.

  • You're supposed to FAILOVER between them, not load balance between them.

    You can't hold amazon accountable for your own stupidity.

    Beyond that, you have to ask yourself the question: how many outages would you have had with your own facility in the past year compared to this outage? Did you apply the same approach to your use of EC2 as you would to your own facility?

    • You can't hold amazon accountable for your own stupidity.

      I'm pretty sure you don't really understand what happened.

      First, they're called Availability Zones. Not to be pedantic, but I just want you to be able to have the correct terminology if you want to read up on this.

      Secondly, a failure in one AZ took out an entire Region. This is NOT supposed to happen. Each AZ is supposed to be considered as a separate datacenter in your application (separate power source, separate facility, separate uplink, etc.) AZs are supposed to be isolated from failures in other AZs.

      Li

  • It's been like 7 years, how's everyone doing? :)

    • I started coming back over here months ago. Reddit is just getting too spammy and filled with... well... digg users.
      • I'm back here because Reddit is down too.

        Comment system is not bad.

        Stories are good. ...but there just aren't very many of them!

        • lol, I know. I'm still in the habit of refreshing the front page expecting something new :p I think the thing I always loved about slashdot was the moderation system though. karma actually *means* something and moderation is rare and not to be wasted.
  • Change.Org says that for the past several days the Chinese have been DDoSing it over a petition they are posting to gather support for Ai WeiWei.

    http://blog.change.org/2011/04/chinese-hackers-attack-change-org-platform-in-reaction-to-ai-weiwei-campaign/ [change.org]

    But if you go to the Change.Org site to sign the petition, you get a message saying that something is wrong with their servers, which are at Amazon.

    http://www.change.org/petitions/call-for-the-release-of-ai-weiwei [change.org]

    http://status.aws.amazon.com/ [amazon.com]

    http://www.comput [computerworld.com]

  • Since apparently no one's actually looked into the issue beyond "ZOMG the cloud is down," here's some info from Amazon:

    8:54 AM PDT We'd like to provide additional color on what were working on right now (please note that we always know more and understand issues better after we fully recover and dive deep into the post mortem). A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, which impacted new EBS volume creation as well as the pace with which we could re-mirror and recover affected EBS volumes. Additionally, one of our internal control planes for EBS has become inundated such that it's difficult to create new EBS volumes and EBS backed instances. We are working as quickly as possible to add capacity to that one Availability Zone to speed up the re-mirroring, and working to restore the control plane issue. We're starting to see progress on these efforts, but are not there yet. We will continue to provide updates when we have them.

    So the engineers failed to foresee a potential hazard. Hardly something to get worked up about, especially for a relatively young technology.

  • Downtime comes from people. The more people involved, the more downtime you'll have.

  • I don't necessarily hate the marketing concept of 'The Cloud', but I am fascinated by the business decisions and risk acceptance that organisations are willing to take. ie- the typical: "Demanding high availability and hot failover, instantaneous incident resolution, and 'we are your primary customer'... but also a low cost." I think that Amazon and their competitors *may* get there with their offerings, but until there is a bit more maturity, I expect to see more incidents like this.

    My wild guess is
  • by kriston (7886) on Thursday April 21, 2011 @11:07PM (#35903164) Homepage Journal

    Amazon and Microsoft have to distinctly different views of "cloud computing."

    When I first learned about "cloud computing" I automatically assumed it meant that there would be an arbitrary number of different services available to an arbitrary number of web servers which would then be served to the user. No one service would depend on the other.

    Amazon's "cloud computing" is centralized upon the virtual machine as the hub of the "cloud." Microsoft Azure, on the other hand, originally offered the approach that I had thought about, where everything is just a service, no VM required.

    Today Amazon still depends heavily on the VM concept. You can't have a web service on Amazon without one. This also makes it excessively difficult to "load balance" or provide "failover" because you are actually expected to stand up new VM instances to scale up and down and need separate VM instances on each "availability zone." In addition it's not easy or affordable to share data between availability zones. This isn't what I thought the cloud was going to be.

    Microsoft eventually added VMs to its Azure service so they could compete with Amazon's VM-centralized concept. I still think the idea of separate, independent services talking to each other was what the "cloud" was supposed to be, and if these services didn't have to depend on these VMs (which they do not have access to because AWS is intermittently down) they would have still been working from the other data centers.

    • `Amazon's "cloud computing" is centralized upon the virtual machine as the hub of the "cloud."'

      I think you hit-the-nail-on-the-head there, a centralized anything is always vulnerable to a this kind of failure. For a business with multiple locations a number of servers sited locally in a peer-to-peer configuration would provide a more reliable service. All they rely on is an end-to-end IP connection. If one site goes then the rest can carry on. I do believe this whole cloud computing concept has been over s

  • As what I would consider a medium-weight AWS user (our account is about 4 grand a month) I am still quite happy with AWS. We built our system across multiple availability zones, all in us-east and had zero downtime today as a result. We had a couple of issues where we tried to scale up to meet load levels and couldn't spin up anything in us-east-1a (or if we could, we couldn't attach it successfully to a load balance because of internal connectivity issues), but we spun up a new instance in us-east-1b and a

    • by The Bean (23214)

      I'd give you the good rating. You used the service in a sane manner that exploited the strengths of the system and avoided the weaknesses.

      I suspect many users of EC2 actually end up with less reliability than they'd get with a server in a closet, as they don't realize the true effort it takes to have an effective solution like you do.

  • I have studied high availability systems and I have developed some. It's not too difficult to come up with a decent design that should guarantee extremely high availability. However, there always will remain assumptions and external factors which influence your system. The most tedious bits are systems and components that "never" fail (like simple NICs) for which you will not get any attention whatsoever.

    100% availability is a myth. I know of a case where IBM's zero downtime operating system (z/OS) went d
  • "For cloud customers willing to pony up a little extra cash, Amazon has an enticing proposition: Spread your application across multiple availability zones for a near-guarantee that it won't suffer from downtime

    I would have thought the the entire raison d'etre of moving to the Cloud was to eliminate downtime, else why not rent two boxes in different locations and achieve this near-guarantee uptime without the extra expense not to mention your data totally disappearing when the Cloud goes down ...

  • ... the "Bleeding Edge" for nothing. New technology always comes with teething problems.

The world is moving so fast these days that the man who says it can't be done is generally interrupted by someone doing it. -- E. Hubbard

Working...