Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Cloud Bug

Dark Day In the AWS Cloud: Big Name Sites Go Down 182

An outage of one company's servers might only affect that company's customers — but when a major data center for Amazon hits kinks, sites that rely on the AWS cloud services all suffer from the downtime. That's what happened today, when several major sites or online services (like Instagram and AirBnB) were knocked temporarily offline, evidently because of problems at an Amazon data center in Northern Virginia. From TechCrunch's coverage of the outage: "The deluge of tweets that accompanied the services’ initial hiccups first started at around 4 p.m. Eastern time, and only increased in intensity as users found they couldn’t share pictures of their food or their meticulously crafted video snippets. Some further poking around on Twitter and beyond revealed that some other services known to rely on AWS — Netflix, IFTTT, Heroku and Airbnb to name a few — have been experiencing similar issues today."
This discussion has been archived. No new comments can be posted.

Dark Day In the AWS Cloud: Big Name Sites Go Down

Comments Filter:
  • Re:Say what you will (Score:5, Interesting)

    by rudy_wayne ( 414635 ) on Sunday August 25, 2013 @08:23PM (#44672679)

    One of the features of AWS was supposed to be the ability to reroute everything to a different datacenter if one goes down. I know I read that somewhere back when AWS was first starting up. You don't think they lied, do you?

  • by JenovaSynthesis ( 528503 ) on Sunday August 25, 2013 @08:26PM (#44672699)

    That went down and I think it ate some files with it. Just before the crash my client reported 103 files being removed. They weren't by me.

  • by MillerHighLife21 ( 876240 ) on Sunday August 25, 2013 @08:40PM (#44672811) Homepage

    I've run servers on both Amazon and Rackspace for several years now and I can't recall a single instance of Rackspace having an outage. On the other hand, Amazon seems to have major issues at least 2 or 3 times a year. Is this stuff tracked anywhere?

  • Re:Say what you will (Score:2, Interesting)

    by AHuxley ( 892839 ) on Sunday August 25, 2013 @09:18PM (#44672985) Journal
    Re you need to pay for resources in each additional region.
    Why the lack of power and real optical links that where regional, power distinct.
    Is this like the idea of linking to a site/city/state/regional 'ring' many times? Very safe from any local cut/drop, cheap, but still very dependant on one geographic provider?
    You also have a submarine communications cable (France to the USA) on the way for that State?? ...the regional services should be good?
  • Re:Say what you will (Score:5, Interesting)

    by Cyberax ( 705495 ) on Sunday August 25, 2013 @09:38PM (#44673079)
    Well, right now I have 500 machines running some heavy calculations in multiple AZs. Works perfectly fine, we have noticed the recent problems but simply stopped using the affected region (us-east-1) for the time being, shifting our calculations to other regions.

    AWS is really great at scaling. It's better than anything else on the market, but it does require a lot of work.
  • Re:Say what you will (Score:5, Interesting)

    by Glendale2x ( 210533 ) <[su.yeknomajnin] [ta] [todhsals]> on Sunday August 25, 2013 @09:40PM (#44673091) Homepage

    No, you have to manage your own redundancy and failover on AWS. Look at all the effort Netflix has put into programming failover and stress testing and yet they still have frequent outages with AWS.

  • by bagboy ( 630125 ) <(ten.citcra) (ta) (oen)> on Sunday August 25, 2013 @10:38PM (#44673317)
    public cloud services as "the future". I will never risk my corporate data uptime and reliability to some "location in the cloud". I'll stick to private clouds (VMWare/VCenter) where I have control of both hardware and software and reliable failsafe systems. At least then if I have downtime I also have accountability and predictability. They same cannot be said for cloud providers and no matter what anyone says once the data leaves your hardware, you have lost that control.
  • Re:Say what you will (Score:4, Interesting)

    by tnk1 ( 899206 ) on Sunday August 25, 2013 @11:01PM (#44673485)

    Supposedly the load balancer problem did not affect LBs that have backing hosts in two availability zones according to the article. The major question is... who runs everything in one availability zone? You're not supposed to do that for high availability sites.

  • by l0ungeb0y ( 442022 ) on Sunday August 25, 2013 @11:38PM (#44673729) Homepage Journal

    Depends on which "future" you are talking about. The future where the bulk of personal data is stored on the cloud to be shared across devices and with friends, family and authorized services is one I think is bound to come to fruition.

    The future where Corporations put their core infrastructure into the Cloud is not one I ever recall anyone talking about.

  • Re:Say what you will (Score:4, Interesting)

    by hawguy ( 1600213 ) on Monday August 26, 2013 @12:17AM (#44673977)

    No they didn't lie. You can set things up that way-simply set up your servers in multiple data centers(AWS availability zones) and load balance between them. It's foolish to just throw things up in the cloud and think magically I won't ever have to worry about downtime ever again.

    But that was one of the big promises of "the cloud": that you'd never have to worry about the nitty-gritty of network administration again, your provider would handle all that for you.

    There are many different flavors of "cloud" computing - if you throw your app at a cloud provider and blindly expect them to make it highly available, then you'll get what you deserve. There is no end of cloud solution providers that will be happy to help you architect your app for whatever level of redundancy you want. But it's not going to be free.

    Amazon does let you get rid of your network admin and concentrate on managing the servers. No need to worry about BGP, buying bandwidth from multiple redundant providers, buying and administering your own firewalls, network switches, routers, etc.

    But you still have to manage your servers. Amazon will help you with multi-AZ redundancy for things like MySQL.

    If that isn't the case, then you gain nothing and might as well host the data yourself.

    That's depends heavily on your use case. If you have a relatively small number of servers, or have large demand spikes, Amazon can be much more cost effective than hosting your own servers. If you have hundreds of servers and keep them busy all the time, you can probably save money by doing it yourself.

    But if you have dozens of servers, then it's likely that you'll save money with Amazon over buying your own servers, network gear, a SAN, backup solution, hardware service contracts, etc.

    But you have to architect your application properly. We have our core servers split across multiple AZ's with the database replicated across those AZ's. We don't trust our failover/failback scripts enough to make it automatic, so we have a simple web interface to let anyone on the tech team do the failover. The only impact we saw in this outage was higher latency and timeouts to some of our app servers, but our database was not in the affected zone, and Amazon's load balancer correctly routed traffic to the servers in the good AZ.

    Additionally, we have a warm spare running in a different region - the servers are kept up to date with data, but they are running in smaller instance types than we need to run our app, do to a regional failover, we'd have to reboot them into larger instance types (our app startup scripts already tune memory parameters to take advantage of the greater amounts of RAM in the larger instances), then repoint DNS.

  • by Joining Yet Again ( 2992179 ) on Monday August 26, 2013 @04:31AM (#44674883)

    But I thought the whole point of the cloud was that everything included redundancy, so a server, or a cable, or a whole datacentre could go down, and because of real time replication, nothing whatever would be missed.

    Or am I just thinking of VAXclusters from, you know, the 1980s.

There are two ways to write error-free programs; only the third one works.

Working...