Amazon Outage Shows Limits of Failover 'Zones' 125
jbrodkin writes "For cloud customers willing to pony up a little extra cash, Amazon has an enticing proposition: Spread your application across multiple availability zones for a near-guarantee that it won't suffer from downtime. 'By launching instances in separate Availability Zones, you can protect your applications from failure of a single location,' Amazon says in pitching its Elastic Compute Cloud service. But the availability zones are close together and can fail at the same time, as we saw today. The outage and ongoing attempts to restore service call into question the effectiveness of the availability zones, and put a spotlight on Amazon's failure to provide load balancing between the east and west coasts."
Re: (Score:3)
For a little extra money, you can get a seat in my biplane, with the extra wings.
Re: (Score:2)
Fools, I have this: http://en.wikipedia.org/wiki/Caproni_Ca.4 [wikipedia.org]
Bow down before the three wings and two engines. Seats are going fast, order a spot today.
Re: (Score:2)
Canadian Host?
http://www.youtube.com/watch?v=LAYMJnO9LBQ [youtube.com]
- Dan.
Re: (Score:2)
Amazon: where failover meets overfail.
It has to be embarassing that a single incident broght down multiple "availability zones" (at least for EBS, maybe other parts of EC2), as that's just what they were supposed to be safe from. Hmm, "overfail", I like it.
Re: (Score:1)
Yay (Score:2)
Yay, cloud. http://www.youtube.com/watch?v=Lel3swo4RMc [youtube.com]
Re: (Score:1)
Re: (Score:2)
Re: (Score:2)
The incident might be eye opening for some people but the cloud cannot theoretically work because it's not a paradigm. Grid computing is a paradigm. Cloud computing is, as you said, marketspeak describing how providers organize their resources internally. Well that's irrelevant because the provider is the single point of failure. Piss off Amazon for whatever reason, your data becomes unavailable no matter how cloudy it was. It's more "cloudy" to simply replicate data locally and on two different providers.
Re: (Score:2)
the cloud cannot theoretically work because it's not a paradigm.
What the fuck is that supposed to mean? It's got words in it, but is entirely vacuous.
Re: (Score:2)
Re: (Score:2)
Sorry if I'm dumbing it down too much, but are you arguing that Grid computing is a specific implementation to a technical problem, while cloud computing is a marketing solution to a business problem?
Re: (Score:2)
Re: (Score:2)
Re: (Score:3)
Cloud Computing generally implies redundancy and non-locality 24/7.
No. It generally implies that you can hire resources (cpu, disk) on short notice and for short amounts of time without costing the earth. You can build high-availability systems on top of that, but HA is not trivial to set up and typically requires significant investment at many levels (hardware, system, application) to attain. Pretend that you can get away with less if you want; I don't care.
Re: (Score:2)
No need to have a sinking feeling, it's always been that way. The "Cloud" is a buzzword, nothing more, nothing less.
Let us learn from Xzibit (Score:4, Funny)
Amazon should put their cloud in a cloud, so the cloud will have the redundancy of the cloud.
Re: (Score:2)
Re: (Score:1)
That's funny. But this whole thing is funny. What the blazes is the cloud for, if it fails? Is it not a cloud but a lead balloon? Sure, I understand no system is perfect, but to steal a line from Seinfeld (you're a car rental agency - you're supposed to *hold* the reservation): Amazon WS - you are a cloud! You are supposed to have 99.99% uptime! That's *all* you are supposed to do! Especially when mainframes have 99.9999% uptime, I believe.
Even distributed systems - which your average web s
Re: (Score:2)
Amazon should put their cloud in a cloud, so the cloud will have the redundancy of the cloud.
Wrong! You need to put the cloud in a 'virtual' cloud!
Cloud computing (Score:5, Funny)
Re: (Score:1)
Re: (Score:2)
What do you mean?
I've had a cloud in my desktop machine for years.
The path to it is "/"
sounds like TWCs DNS servers (Score:2)
I lose access to TWC's DNS servers regularly (yes, I will be setting up my own, when it becomes annoying enough). Although you can do a quick-and-dirty load-balancing by setting them up as follows, there's no redundancy for the customers when there's a link failure.
search socal.rr.com
nameserver 209.18.47.61
nameserver 209.18.47.62
Re: (Score:3)
or just use 4.2.2.1 and 4.2.2.2.. or 8.8.8.8 and 8.8.4.4.
Re: (Score:1)
Re: (Score:3, Informative)
...and get slow performance on anything delivered via Akamai or similar services which try to use regional data centers.
OpenDNS and Google DNS are hacks that work increasingly badly.
Re:sounds like TWCs DNS servers (Score:4, Interesting)
Infact, I'm not sure how they could be doing geolocation by the client's DNS servers... are you sure about that?
Re: (Score:1)
I'm personally completely sure. I run a recursive DNS server at work for DNS lookups, and I get very different answers for www.akamai.com when I manually query 8.8.8.8 vs our own recursive DNS server.
It doesn't help that a RTT to the IP returned by 8.8.8.8 is over 200 ms, but the latter is around 10 ms. (I'm in New Zealand, 200 ms RTTs to popular, US based, websites is very normal. Heck, I have 228 ms RTT pinging slashdot.org right now on my home DSL. Clearly, using our own recursive DNS, I'm hitting an
Re: (Score:2)
Typically your computer asks your firewall/router for a DNS lookup. It relays that to your ISP's DNS server. Your ISP looks up the DNS server responsible for the domain and contacts that server and sends your original request. That request doesn't include your IP however, so Akamai's DNS servers are returning regional specific servers based on your ISP's DNS server IP/geo-location. That's usually perfectly acceptable, since presumably your ISP's DNS server would be located on a good route with a low pin
Re: (Score:3)
Re: (Score:1)
No. DNS-based geo-location caching schemes are the culprit. It works off a bad assumption that makes using an alternative DNS server a pain. I Don't like my ISP's DNS servers. They hijack domain typos as a revenue stream, so I consider them hostile and ignore them when I can.
Using google or opendns, however, will cause havoc for a couple of surprisingly common things I've experienced problems with:
Akamai
Itunes
Hotmail
Rackspace hosted exchange service
Netflix
Youtube
Fortunately you can configure your network's
Re: (Score:2)
I don't believe what you say is true, at least not anymore.
I'm in Malaysia, a bandwidth-constrained country 200ms from the USA. Using the wrong CDN node makes a huge difference.
When I use 8.8.8.8 to find www.apple.com, I get e3191.c.akamaiedge.net, which is 17ms from my house. That's as good as it's going to get.
Perhaps Google has started using source IPs for its DNS queries that match the client's location?
Re: (Score:2)
have your own servers (Score:5, Insightful)
Re:have your own servers (Score:5, Insightful)
This incident illustrates once again why you need to put your stuff on your own servers and not someone else's. All computer systems will fail occasionally. There's no such thing as 100% uptime. However, when your own servers fail you can get your own people working on it right away and it's their number one priority. When your stuff is on someone else's servers, you're at their mercy. It will get fixed when they get around to it, and, they have more customers than just you, so you might not be first on the priority list. Or second. Or third. Or tenth.
Re: (Score:2)
Well. Or put your stuff on your own servers as well as someone else's. Cloning your services into various clouds isn't insane as a tool for handling some types of unplanned scaling requirements or some types of unplanned outages. Relying on those clouds introduces risks that were just demonstrated.
Re: (Score:3)
Well. Or put your stuff on your own servers as well as someone else's. Cloning your services into various clouds isn't insane as a tool for handling some types of unplanned scaling requirements or some types of unplanned outages. Relying on those clouds introduces risks that were just demonstrated.
It's probably worth noting that EMC makes a cloud storage product called Atmos with an API essentially identical to Amazon's S3 service. The main difference is that the HTTP headers start with x-emc instead of x-amz, so a properly written application running on non-Amazon servers could switch fairly easily between the two for load balancing or redundancy.
Re: (Score:1)
...or, have your stuff on the same servers with their Most Important Customer.
xD
Re: (Score:2)
This incident illustrates once again why you need to put your stuff on your own servers and not someone else's.
Hosting everything yourself? Can we sell you a contract for us to build you a datacenter? Then there's the ongoing costs of actually operating it.
Or were you thinking that a scavenged rack in a old closet previously only used by the janitor was a substitute?
Re: (Score:3)
Nah. It shows that when you buy a product or service you need to understand what you are paying for, not extrapolate from a buzzword like "cloud".
You can't make a blanket statement one way or another about using something like EC2 without considering the user's needs and capabilities. There may be users who'd find the recent outage intolerable ;they probably shouldn't be using EC2. But if they have good reasons to consider EC2 chances are they are goig to spend more money.
Re: (Score:3)
Pop quiz:
Youre a small company that does software development. You need servers to do deployment testing, basically just apache and the customized package. Uptime is a must, and your budget is limited.
Do you...
A) Spend tens of thousands on servers, plus backup power, plus racks, plus redundant switches, plus dual WAN links, plus a backup solution (for 10 servers, so far youre looking at ~$35k, plus a thousand a month on WAN links)
or
B) Trust that Amazon will have FAR better uptime than you could EVER dream
Re: (Score:1)
Really, it depends on the financial hit your organization will take by downtime/lost productivity/lost business/lost confidence in the ability of your organization to be able to deliver your product. If being down for 12-15 hours or more will cost you more than $35k then yes it makes sense for you to roll your own solution or to use traditional dedicated hosting providers in a H/A configuration. Every organization needs to perform their own risk analysis. If downtime that is out of your control is accept
Re: (Score:2)
Reddit's downtime has been a bit of a running joke for a while now, which most (all?) of it being blamed on Amazon.
The way they implemented things is one of the big issues. For example, things like setting up RAID volumes across multiple EBS volumes. They just magnified their exposure to any issues in the cloud. Any one machine goes down the system gets hosed and needs recovery. They also are constrained to a single availability zone in order to get the performance they need from their setup. (This is n
Re: (Score:2)
Basically what youre saying is you cant just throw the "cloud" around like its a magical fix-all; and thats true. But every time one of these "big company goes down" stories arises, people seem to take that as proof that the cloud is not useful for anything, and I would challenge that assertion. There are a number of times where you need to rapidly expand, or where you need good uptime and scalability but dont havea big budget; and for that, the cloud really shines.
Re: (Score:2)
No. If a zone goes down in CA, I can have a new server up in Virginia within minutes. I would rather be on ec2 when I go down. I guarantee I will be back up faster than you.
Re: (Score:1)
Re:have your own servers (Score:5, Informative)
So wait. The cloud sales pitch is "no more servers-save money-cut IT staff" but now its:
1. Virtualized servers in zone 1
2. Virtualized servers in zone 2
3. Virtualized servers from a different company altogether.
So I went from one solid server, good backups, maybe a hot backup, and talented staff running the show to outsourced to 3 different clouds with hour-long hold times with some Amazon support monkey? Genius.
Re: (Score:2)
Re: (Score:2)
So I went from one solid server, good backups, maybe a hot backup, and talented staff running the show to outsourced to 3 different clouds with hour-long hold times with some Amazon support monkey? Genius.
I hope that one good server is in a disparate geographical location from its hot backup, using a separate transit provider, each server has redundant power supplies, and your talent has a bus factor of (#servers)+1 or more. You're gonna need backup for any load balancing as well, and whether that should be in yet another location is, well, something to consider.
Cloud services should give you the redundancy you need, as well as being easily scalable. Why are you trying to say the whole concept is bad just be
Re: (Score:1)
Re: (Score:2)
Which provider's implementation is not flawed?
Re: (Score:2)
Re: (Score:2)
If one can afford that kind of redundancy, then sure. Two independent lines coming in from two independent providers that individually will adequately handle all traffic for an extended period of time.
Why would you do that? It's enough to do things like run two DCs that can each handle 60% load or three that each handle 40% load. Not that much more expensive, and downtime turns into "the site is slow". There are architectural concerns, especially with data replication, but this is definitely doable, and it doesn't cost a mint.
philosophical POV (Score:2, Insightful)
I'll take the philosophical point of view on this and say failures are the best way to find and diagnose systemic weaknesses. Now Amazon knows the weakness in the AZs and can fix it.
Re: (Score:2)
And thus the gullible managers who ignored IT... (Score:3)
and only heard, "Cloud, cloud, cloud! It's new and shiny and cheaper than those annoying internal IT guys so I get a bonus!" learn to pay the stupidity tax.
Next up, learning just how *much* of your cloud data has been stolen and resold by those trustworthy souls in China and India.
Cheers!
Gullible manager doesn't care (Score:3)
Re: (Score:2)
C'mon. All managers love cloud!
What rolls down stairs, fails over in pairs,
Leaks data when it's allowed?
A stupidity tax, it replaces your racks,
It's cloud, cloud, cloud!
It's cloud! It's cloud! It's new, it's shiny, it's cheap!
It's cloud! It's cloud! It's down, and now you'll weep.
Everything's
Re: (Score:1)
Re: (Score:1)
It is a sad joke. Even for sites like Reddit whose administrators are supposed to know better, the Amazon shit hit. And the terrible thing is that it is not the first time that Amazon's service has broken, this has happened quite a lot in the last months, and people still *pay* for the service. Crazy.
Re: (Score:1)
and only heard, "Cloud, cloud, cloud! It's new and shiny and cheaper than those annoying internal IT guys so I get a bonus!" learn to pay the stupidity tax.
Next up, learning just how *much* of your cloud data has been stolen and resold by those trustworthy souls in China and India.
Cheers!
As a developer from India, I can tell you India has absolutely nothing to do with the cloud.. thats a US revolution.. we are still years behind..
Turning lemons into lemonade or...... (Score:2, Funny)
Re: (Score:3, Insightful)
Re: (Score:2)
Reddit has been down for approx 24 hours--it's been on RO mode for most of the day. Pretty bad PR for Amazon.
What if Amazon was down instead for Reddit, etc.? (Score:1)
Do you think Amazon would allow its own sales and services to be impacted for 12 hours (and running) under any circumstances short of the recent disaster in Japan? EC2 customers, on the other hand, appear to be second-class citizens.
Availibility zones must be done PERFECTLY (Score:2)
"Availibility Zones", "Failure Domains", etc. must be done with absolute perfection if you do them at all. If your gargantuan application has some single tiny side-feature that is not replicated across domains, your whole app is going down.
True Story: I was doing some consulting work for a large bank after they had a bunch of problems. Their main website had all the super-available trimmings: Oracle RAC, mutli-site server clustering, storage mirroring, all the fancy, expensive, highly-available crap you
Re: (Score:2)
You could code your application to be tolerant of those kinds of outages. The services backing feature X aren't available? Then don't render the controls for feature X on the page.
Re: (Score:2)
They're called "FAILOVER" zones for a reason... (Score:2)
You're supposed to FAILOVER between them, not load balance between them.
You can't hold amazon accountable for your own stupidity.
Beyond that, you have to ask yourself the question: how many outages would you have had with your own facility in the past year compared to this outage? Did you apply the same approach to your use of EC2 as you would to your own facility?
Re: (Score:2)
You can't hold amazon accountable for your own stupidity.
I'm pretty sure you don't really understand what happened.
First, they're called Availability Zones. Not to be pedantic, but I just want you to be able to have the correct terminology if you want to read up on this.
Secondly, a failure in one AZ took out an entire Region. This is NOT supposed to happen. Each AZ is supposed to be considered as a separate datacenter in your application (separate power source, separate facility, separate uplink, etc.) AZs are supposed to be isolated from failures in other AZs.
Li
Re: (Score:2, Funny)
Re: (Score:1)
Re: (Score:2)
I'm back here because Reddit is down too.
Comment system is not bad.
Stories are good. ...but there just aren't very many of them!
Re: (Score:2)
Is this related to the DDoS of Change.Org? (Score:2)
Change.Org says that for the past several days the Chinese have been DDoSing it over a petition they are posting to gather support for Ai WeiWei.
http://blog.change.org/2011/04/chinese-hackers-attack-change-org-platform-in-reaction-to-ai-weiwei-campaign/ [change.org]
But if you go to the Change.Org site to sign the petition, you get a message saying that something is wrong with their servers, which are at Amazon.
http://www.change.org/petitions/call-for-the-release-of-ai-weiwei [change.org]
http://status.aws.amazon.com/ [amazon.com]
http://www.comput [computerworld.com]
No need to speculate (Score:2)
8:54 AM PDT We'd like to provide additional color on what were working on right now (please note that we always know more and understand issues better after we fully recover and dive deep into the post mortem). A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, which impacted new EBS volume creation as well as the pace with which we could re-mirror and recover affected EBS volumes. Additionally, one of our internal control planes for EBS has become inundated such that it's difficult to create new EBS volumes and EBS backed instances. We are working as quickly as possible to add capacity to that one Availability Zone to speed up the re-mirroring, and working to restore the control plane issue. We're starting to see progress on these efforts, but are not there yet. We will continue to provide updates when we have them.
So the engineers failed to foresee a potential hazard. Hardly something to get worked up about, especially for a relatively young technology.
I've been saying (Score:1)
Downtime comes from people. The more people involved, the more downtime you'll have.
Change related? (Score:1)
My wild guess is
Amazon and Microsoft (Score:3)
Amazon and Microsoft have to distinctly different views of "cloud computing."
When I first learned about "cloud computing" I automatically assumed it meant that there would be an arbitrary number of different services available to an arbitrary number of web servers which would then be served to the user. No one service would depend on the other.
Amazon's "cloud computing" is centralized upon the virtual machine as the hub of the "cloud." Microsoft Azure, on the other hand, originally offered the approach that I had thought about, where everything is just a service, no VM required.
Today Amazon still depends heavily on the VM concept. You can't have a web service on Amazon without one. This also makes it excessively difficult to "load balance" or provide "failover" because you are actually expected to stand up new VM instances to scale up and down and need separate VM instances on each "availability zone." In addition it's not easy or affordable to share data between availability zones. This isn't what I thought the cloud was going to be.
Microsoft eventually added VMs to its Azure service so they could compete with Amazon's VM-centralized concept. I still think the idea of separate, independent services talking to each other was what the "cloud" was supposed to be, and if these services didn't have to depend on these VMs (which they do not have access to because AWS is intermittently down) they would have still been working from the other data centers.
centralized cloud computing ?? (Score:2)
`Amazon's "cloud computing" is centralized upon the virtual machine as the hub of the "cloud."'
I think you hit-the-nail-on-the-head there, a centralized anything is always vulnerable to a this kind of failure. For a business with multiple locations a number of servers sited locally in a peer-to-peer configuration would provide a more reliable service. All they rely on is an end-to-end IP connection. If one site goes then the rest can carry on. I do believe this whole cloud computing concept has been over s
*Still A Happy, Paying EC2 Customer* (Score:2)
As what I would consider a medium-weight AWS user (our account is about 4 grand a month) I am still quite happy with AWS. We built our system across multiple availability zones, all in us-east and had zero downtime today as a result. We had a couple of issues where we tried to scale up to meet load levels and couldn't spin up anything in us-east-1a (or if we could, we couldn't attach it successfully to a load balance because of internal connectivity issues), but we spun up a new instance in us-east-1b and a
Re: (Score:1)
I'd give you the good rating. You used the service in a sane manner that exploited the strengths of the system and avoided the weaknesses.
I suspect many users of EC2 actually end up with less reliability than they'd get with a server in a closet, as they don't realize the true effort it takes to have an effective solution like you do.
100% availability myth (Score:2)
100% availability is a myth. I know of a case where IBM's zero downtime operating system (z/OS) went d
for a little extra cash ? (Score:2)
"For cloud customers willing to pony up a little extra cash, Amazon has an enticing proposition: Spread your application across multiple availability zones for a near-guarantee that it won't suffer from downtime
I would have thought the the entire raison d'etre of moving to the Cloud was to eliminate downtime, else why not rent two boxes in different locations and achieve this near-guarantee uptime without the extra expense not to mention your data totally disappearing when the Cloud goes down ...
They don't call it ... (Score:1)