Forgot your password?
typodupeerror
Cloud IT

Amazon EC2 Crash Caused Data Loss 112

Posted by timothy
from the but-that-was-off-site-backup! dept.
Relayman writes "Henry Blodget is reporting that the recent EC2 crash caused permanent data loss. Apparently, the backups that were being made were not sufficient to recover the lost data. Although a small percentage of the total data was lost, any data loss can be bad to a Website operator."
This discussion has been archived. No new comments can be posted.

Amazon EC2 Crash Caused Data Loss

Comments Filter:
  • by Anonymous Coward

    srsly, as in your own

  • by Man On Pink Corner (1089867) on Friday April 29, 2011 @02:25AM (#35972194)

    ... the confusion of ideas that would lead someone to treat their live web server as their primary/master data repository.

    I guess I'm still stuck in Commodore 64 World, or something..

    • well, the a data center run by amazon certainly has more rigorous backup and maintenance schedules than anything I could personally come up with, offhand. It took something pretty catastrophic to bring it down and cause data lass. the problem is if someone decided to only have one copy, at amazon. If they had 2 at 2 different servers, success!
      • by Anonymous Coward
        naaa, I've not heard of any meteor swarm hitting amazon servers, so there was not anything catastrophic.

        just business as usual.
      • I'm not so sure about rigorous...
        1- I personnally have never lost a single byte of meaningful data
        2- do amazon detail their exact procedures and commitments ?
        3- do amazon backup those "commitments" with hard cash ? How much will the people whose data they lost be compensated ?

        read the sig....

        • by jc2brown (1997958)
          You might want to read this [amazon.com].

          They're crediting all accounts that had any activity in the USA-East region for 10 days of usage, regardless if they were affected.

          Remember that it was EC2 that was affected, which is just a virtual machine with volatile storage. Had it been S3 data that was lost one should expect restitution, but in this case downtime and data loss is ultimately the fault of the user.
        • 1- I personnally have never lost a single byte of meaningful data

          Yep--the moment I accidentally 'rm -rf /', I simply re-classify the drive as 'not containing meaningful data' and my stats are saved.

      • by MichaelSmith (789609) on Friday April 29, 2011 @02:53AM (#35972280) Homepage Journal

        It took something pretty catastrophic to bring it down and cause data lass

        Catastrophic would be an earthquake, tsunami and meltdown, in that order. From my reading of the situation amazon stuffed up their own replication mechanism and it recursively replicated the system to fill up the available hardware. Thats just bad design. Its obvious they did no testing under realistic conditions.

        • I wouldn't be too hard. Yes, they screwed up - but a bug like that could easily slip through testing, as it might only occur on extreme-sized data sets. Their real screwup was in not noticing right away and reverting to the previous config.
      • by greenbird (859670)

        well, the a data center run by amazon certainly has more rigorous backup and maintenance schedules than anything I could personally come up with

        It's funny. Not a single place I've worked at has had as good of backups as I have for my personal stuff. And I didn't even spend 6 figures for some useless enterprise backup solution. Some scripting, cp -al, rsync, dmcrypt, ssh and a remote PC at my girlfriends house and you have an incremental backup solution more secure and more robust than any enterprise solution I've ever seen, and it only cost a couple hundred for the drives.

        • Re: (Score:2, Insightful)

          by zonky (1153039)
          Congrads. Does your GF have a key to your house? Because your "perfect system" has a single point of failure- a insider who could cause damage to both causing loss of data. Best not get on her bad side for now, anyway....
          • Hmmmm? You don't store copies of your data in remote locations? What about a fire? I think a backup scheme must store data remotely. Leave one copy with your parents (upstairs), and one with a friend in australia (I guess across the street...)
          • by renoX (11677)

            He didn't claim that his backup system was perfect, just better than what many enterprise do, which is probably true.

            > Best not get on her bad side for now, anyway....

            Bah, if he is wise his backups are encrypted, so this shouldn't be a big issue (unless he has a bad break-up with his GF and loose data at the same time: Murphy's law in action).

          • by tigersha (151319)

            This is definitely better than Amazon's backup plan. It backs up data and LITERALLY screws you. Amazon just screws you.

        • by jpapon (1877296)
          Until she dumps you and throws your backup drives out her window that is. Tying the security of your backup to the security of your relationship is an interesting gamble. One day you might find yourself lonely AND data-less.

          Unless of course you're one of those people who refers to female friends as "girlfriends", in which case, I hate you.

          • by Yvanhoe (564877) on Friday April 29, 2011 @04:14AM (#35972550) Journal
            That is still better than Amazon's plan actually.
          • by brusk (135896)

            Unless of course you're one of those people who refers to female friends as "girlfriends", in which case, I hate you.

            Women do that too, but this is /.

          • Well, so long as his backups at her place consist only of jpeg pictures of their relationship, you could actually kill two birds with one stone there...
          • by greenbird (859670)

            Until she dumps you and throws your backup drives out her window that is.

            She'd have to come here and trash those also. That'd be a trick though. I have some 10 computers spread over several rooms here and a dozen or more external drives. I never said it was perfect. Just better than any enterprise setup I've seen. The malicious insider is always the toughest hole to cover in any data protection scheme.

        • by jimicus (737525)

          Bigger solutions are invariably more complicated. And when they're more complicated, there's more to go wrong - and when it does go wrong, there's more that can be affected.

          This is why I'm quite wary of people throwing the word "Enterprise" around. IME, it's frequently a codeword meaning "A proprietary vendor has told us their product can be all things to all men - which is technically true but what we're buying needs many more man-hours of work to turn it into anything for anyone than we can hope to dedica

        • My backup system is no where near the studio I work at's backup system. But I could deploy something even easier for one simple reason: I have less data.

          Do you know how much it would cost to remote push 10TB of data once a week?

          • That's what differentials are for.
          • by greenbird (859670)

            Do you know how much it would cost to remote push 10TB of data once a week?

            If you're generating 10TB of data weekly you're talking about an extremely rare situation that would require a specialized solution anyway since no backup solution out there can support that.

            If your talking about have around 10TB of data total that's what rsync is for. You do fast incrementals on the local system over a high bandwith pipe (sata, sas). Then you have 2 options depending on your system requirements. You either run a daily rsync of one of the incermentals directly offsite over a separate netw

        • by Anonymous Coward

          What's a girlfriend?

        • by Eivind (15695)

          I've pondered the same thing. My workplace spends 6 figures, and as far as I'm able to tell, gets significantly less than I have at home, despite my investment being 2 orders of magnitude less.

          Every 2 hours for the last day, every day for the last week, every week for the last month, every month for the last year, every year forever. Physically backed up to 2 distinct discs inhouse (one of which is pretty burglar-proof, living in safe), and 2 encrypted copies under the care of 2 distinct companies, in diffe

          • by mwvdlee (775178)

            Just curious; what 5TB worth of personal data requires a 4-figure backup spending?

            • by Cyrrus30 (1993140)
              That's what I was wondering. What can be 5 TB of personal data? My "real" worhtwhile data (meaning things that I can't get back if something wrong happens) takes like 30 GB. And 29.5 of this are pictures. But things like downloaded movies or MP3 (a certain chunk of it being illegally acquired) are not worth backing up using such a complicated scheme. My house burn and I lose some MP3 I bought on iTunes? No big deal, there would be alot of other things more important that would bug me (like, you know, my h
            • by MarkGriz (520778)

              Just curious; what 5TB worth of personal data requires a 4-figure backup spending?

              Porn

          • by jimicus (737525)

            Where things start to get more complicated is when the data being stored requires some massaging before you can take a copy - or for that matter if you can only take a copy under specific circumstances or your copy is only useful under specific circumstances.

            For instance:

            Most modern databases store their data in files on the disk. Database transactions are atomic, sure. And (hopefully, assuming a modern FS) so are disk transactions. This does not mean, however, that you can simply copy the underlying fil

        • by arndawg (1468629)
          Wait. Your backup target is ONLINE? I've got news for you buddy. You don't have backups!
        • It doesnt take 6 figures, and if it does and isnt as good as Rsync, you need to find a new line of work.

          Any backup solution should include tape, for a very simple reason-- you end up with multiple offline copies of data at about $25 per terabyte. Your backup solution sounds like it gets knocked out if someone introduces bogus data into your system; once the backup occurs, it overwrites all your good backups.

          A good backup system isnt even that expensive; about $4000 will get you an LTO4 autoloader with a fu

          • by greenbird (859670)

            Your backup solution sounds like it gets knocked out if someone introduces bogus data into your system; once the backup occurs, it overwrites all your good backups.

            Ummm...you need to look at what cp -il and rsync -H does. Let's just say you're completely and utterly wrong to be polite about it. I have exact images of my systems at 4 hour intervals available instantly just by going to my backup drive. No need to go digging through tapes, spending hours looking through dumps and incrementals or trying to figure out if it's on an offsite tape. I suspect you're one of those people who set up the backups for one the companies I've worked for. Tape is utterly useless in thi

            • I have used (and still do) use Rsync-- i missed the -H option in your post, but regardless, all of your history is in one place.

              The biggest problem I have seen with Rsync is that in directory structures with tens of thousands of tiny files, it takes a VERY long time to search for changes, which can be a problem if you have, for example, database files which need to be taken out of use for backup (the particular database im dealing with doesnt have a "dump" command, as it, itself, is used for remote backup a

              • by greenbird (859670)

                all of your history is in one place.

                No, it's not. It's here and at my offsite location (girlfriends house atm).

                Also, good luck rotating 24TB of removable storage; with a budget of $6500

                There's no rotation required. Once it's set up the only time you have to touch a drive is to replace a bad one. With well less than half that I could build a backup server with 24TB. For that much I could have RAID 1 on the backup server. Not only that but I can have backup images at, say, 4 hour intervals instantly available at any given moment. No trying to get an offsite tape hoping they get the right one or haven't lost it or get

                • No, it's not. It's here and at my offsite location (girlfriends house atm).

                  Which means you have 2 copies. The tape system I will be setting up this week (~$5000) will give me 20 copies, on LTO4 tapes. Thats 20 entire(ish) backups of our network, with the ability to roll back to any date within the last 3 weeks (our supplemental backup system covers dates further back).

                  With well less than half that I could build a backup server with 24TB. For that much I could have RAID 1 on the backup server.

                  Have fun with your error and drive failure rates. 24 drives @ 2TB each (RAID1) means an awful lot of failures each year. Thats also a LOT of arrays to present to your system-- in order to have any reasonable kind

                • Sorry for double post, I also did want to point out that here

                  Also, good luck rotating 24TB of removable storage; with a budget of $6500

                  That $6500 budget went towards both the autoloader, and ~140TB of tape storage-- that is, 5 or so sets of ~24TB each (100 tapes).

                  Tape is currently $16/TB or so. HDD platters, for the absolute cheapest deals (1TB drives), are around $40/TB-- 3x the price-- and require RAID cards to drive them, as well as hotswap cages. You will not be able to match tape prices for a long, long time, if ever.

      • well, the a data center run by amazon certainly has more rigorous backup and maintenance schedules than anything I could personally come up with, offhand

        And a fat lot of good that did.

      • by CSMoran (1577071)

        bring it down and cause data lass.

        I'm a lad, you insensitive clod!

      • by Khyber (864651)

        "the a data center run by amazon certainly has more rigorous backup and maintenance schedules than anything I could personally come up with, offhand."

        Differential backups every five minutes across three different backup systems rolling RAID-6, one local and two remote.

        I don't have data loss issues, EVER.

        And I'm just a low-level tech guy. If Amazon can't get it right and I can, something is wrong.

    • by Anonymous Coward

      I am not rightly able to comprehend how some whose primary data source is people entering data via their website could have it elsewhere. Sure, they can have copies elsewhere, but those would be the backups.

    • by wvmarle (1070040) on Friday April 29, 2011 @04:20AM (#35972576)

      From a look at the linked article, it seems that one of the issues is data generated by these web sites. Such as user statistics, or user uploaded content, etc. That naturally lives primarily on the live web server and is also data that you don't want to lose. Also as other commenters mentioned as well the EC2 service is not a cloud-storage server, it's a web hosting service, and web hosts tend to indeed generate their own data.

      This data of course needs to be backupped actively, and one would expect a web host to include that in its service. That's one of the reasons to pay for such a service, instead of doing it yourself.

      Besides relying on their backups it's of course a good idea to regularly take backups yourself. But even if you do this daily, it means you may lose up to a day's worth of data. And that's (partly) what happened here. It's similar to someone who takes a photo on a digital camera, and subsequently loses that camera and the photo with it. You don't say "they shouldn't use a camera as primary data repository". It isn't. It's a temporary repository, and when the data is generated it's the one and only repository, simply pending copying to backup media.

      • It's not a "web host" even.. it's simply a virtual machine run-time environment. You setup the OS, and configure it... Amazon does not... they provide storage facilities that can be used to backup to, and even mount to your host OS. Also, many virtual machine, or virtual host providers don't necessarily provide backup solutions.
    • I guess I'm still stuck in Commodore 64 World, or something..

      Cassette tapes? I'm so very sorry.

  • by DWMorse (1816016) on Friday April 29, 2011 @02:28AM (#35972198) Homepage
    Was the lost data... all the stuff the PSN network lost? I think I see a connection!
    • by mini me (132455) on Friday April 29, 2011 @02:40AM (#35972236)

      Cloud applications hosted on Amazon survived this incident without issue, as expected. Only the regular old hosted applications had problems with the outage. They were never "the cloud" to begin with, so I'm not sure why the term even comes up in this discussion.

      The cloud represents a black box that hides the underlying network topology so that there are no single points of failure. Cloud applications are tolerant because they are spread through different datacenters across multiple points of in world. A catastrophe at one or more datacenters will have no noticeable effect on the availability of a cloud application because it continues to run in many more.

      Amazon offers a few cloud applications: S3 comes to mind. But Amzon's EC2/EBS hosting service is a plain old hosting service like any other. The EC2 topology is not hidden away from you. You have to make active decisions about where you want your EC2 instance to live. That goes against the idea of the cloud. What Amazon does offer in EC2 is the tools necessary for you to build a cloud application, but not everything hosted on EC2 is a cloud application by default.

      • When you put it like that, it's hard to just bash things.

        Stupid facts.

      • You're right, by default EC2 isn't a cloud solution, but Amazon doesn't help alleviate that confusion (from their website for EC2):

        Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable compute capacity in the cloud. It is designed to make web-scale computing easier for developers.

        It doesn't help when the name of the service (EC2) has the word "Cloud" in it.

  • What is S3? (Score:5, Informative)

    by badran (973386) on Friday April 29, 2011 @02:36AM (#35972224)

    EC2 is not meant to be used for data storage, that is what S3 is designed for. You store data and backups on S3, and use EC2 to serve high bandwidth websites to the masses.

    • by Big_Mamma (663104)

      And that's exactly how EBS is supposed to be backed up - it saves snapshots all the time to S3. Small and cheap incremental backups stored to a 99.999999999% durable [amazon.com] storage area. But apparently, Amazon messed up the backed up copies as well - instead of producing an outdated, but valid snapshot, they replied to affected customers with:

      A few days ago we sent you an email letting you know that we were working on recovering an inconsistent data snapshot of one or more of your Amazon EBS volumes. We are very

      • by Anonymous Coward

        The quote is about the snapshots they took before they started recovery efforts. If you took snapshots of your EBS volumes regularly, these were not affected at all...

    • EC2 is not meant to be used for data storage, that is what S3 is designed for. You store data and backups on S3, and use EC2 to serve high bandwidth websites to the masses.

      I don't think this is a fair criticism of people who lost data.

      S3 isn't designed as an online datastore for live applications. Sure, I can put any content in there that I want, but it can't be up-to-the-millisecond.

      AWS said to consider EBS volumes to be like hard disks, with a similar failure rate to hard disks. I forget the expected failure rate that they posted, but I think it was roughly between 1:100 and 1:1000 EBS volumes should be expected to fail each year. So go ahead and make your usual solutions w

  • Guess Wikileaks feels good about not being hosted there anymore.... their critical information could have been "lost" as well....
  • What is more scaring for me, is that Amazon tell you that they have multiple availavility zones on each zone, and recomends you to distribute replicated servers, on each of this zones, for example I have a project with the master database in one zone, and the replica on the other zone. Why both zones fail?? Are not isolated/independent? Amazon charges you for data transfer between zones. As other says fails the servers, anyone must had backups on other place (S3, or Amazon external).
    • by Anonymous Coward

      Availability zones are just one part of the picture for a good software design in data centers. While they are isolated from each other, they may still be in the same location (or within the same general area), and could more easily be hit with a network partition, floods, tornadoes, etc. Instead, as most big businesses know (like Netflix, which didn't suffer from the outage), you need regional separation in addition to availability zones. Amazon does provide this functionality, it just costs more, and most

      • by nereid666 (533498) *
        From: http://aws.amazon.com/es/ec2/ [amazon.com]
        Availability Zones are distinct locations that are engineered to be insulated from failures in other Availability Zones and provide inexpensive, low latency network connectivity to other Availability Zones in the same Region. By launching instances in separate Availability Zones, you can protect your applications from failure of a single location.

        Better than use different region, I think it is better have multiple cloud providers...
  • by nereid666 (533498) * <spam@damia.net> on Friday April 29, 2011 @05:17AM (#35972734) Homepage
    Post morten Amazon explanation:
    http://aws.amazon.com/message/65648/ [amazon.com]
  • any data loss can be bad to a Website operator.

    any data loss is catastrophic, if it's your data. They claim "a small percentage" of data was lost... 1% is a small percentage... 10% is also small percentage, but it's a huge amount of data.

    Fortunately where I live and work there isn't really sufficient and reliable connectivity to "the cloud" to make it a worthwhile endeavor, so hopefully all the mistakes are learnt from before I have to worry about it.

    • by Rakishi (759894)

      any data loss is catastrophic, if it's your data.

      No, it's only catastrophic if you're an idiot. Then again many website operators seem to be just that given how many need to use google cache to recover data after their web provider's server croaks.

      Anyway, having your data in any single unreliable location is a recipe for disaster. And yes, with a 0.5-1% annual failure rate EBS is unreliable and no one claims otherwise. If you want reliable you use S3 and off-site backups.

      • No, it's only catastrophic if you're an idiot. Then again many website operators seem to be just that given how many need to use google cache to recover data after their web provider's server croaks.

        Anyway, having your data in any single unreliable location is a recipe for disaster. And yes, with a 0.5-1% annual failure rate EBS is unreliable and no one claims otherwise. If you want reliable you use S3 and off-site backups.

        Please explain to me how I can keep my data in S3 and/or offsite backups up-to-the-millisecond.

        I'll wait.

        • by Rakishi (759894)

          And who is forcing you to use E2 and EBS? Is there a gun to your head? Why in god's name are you using an infrastructure that is clearly not compatible with your needs?

          • And who is forcing you to use E2 and EBS? Is there a gun to your head? Why in god's name are you using an infrastructure that is clearly not compatible with your needs?

            Was that supposed to be an answer to my question? Is every application that fails to store all of its data immediately in S3 "clearly not compatible" with EC2?

            • by Rakishi (759894)

              If losing the intermittent data is catastrophic then yes they're not compatible. Find a different solution.

              That said, database replication is a very old problem and solutions exist to that. Likewise, some applications simply don't suffer too much from losing a bit of data so the cost of that is low. Other applications have no data to lose since they're simply acting as data serving platforms. And that's all for web applications which aren't quite what EC2 was made for, after all it's called "elastic compute

              • That said, database replication is a very old problem and solutions exist to that.

                Thank you for at least attempting to answer my question.

                While your answer does not involve the use of S3, it is exactly the answer that should have worked, but did not in the case of yesterday's outage. Replicate your database to a different availability zone. Great, except EBS failed region-wide, so your slave database just died, too.

                Not that I'm really even all that upset about the outage. My application was down for a few hours and degraded for a few more hours until I could replay the transactions that

  • Unless you pay extra, they say you can expect to lose data stored in S3 on a regular basis. There's nothing wrong with that per se, but it's something you need to plan for.

    S3:

    Designed to provide 99.99% durability and 99.99% availability of objects over a given year. This durability level corresponds to an average annual expected loss of 0.01% of objects.

    http://aws.amazon.com/s3/ [amazon.com]

    EBS:

    ...Amazon EBS snapshot can expect an annual failure rate (AFR) of between 0.1% - 0.5%, where failure refers to a complete los

    • The durability you quote for S3 (99.99%) is for the reduced redundancy option. The standard storage lists 99.999999999% durability.

  • by olau (314197) on Friday April 29, 2011 @06:35AM (#35972952) Homepage

    This is not the first time I've heard about a big hosting centre losing data even though it never happens, and they are keeping backups, etc.

    It if it's at all manageable, keep one copy safe at your own place in addition to the replication at the hosting centre. You can set up a cheap box at the office with a couple of terabytes disk space and suck down the data periodically with something like rsync and rdiff-backup. It's not a whole lot of work and can make the difference between having a big problem and total disaster.

    It would help if hosting centres actually told you how exactly they store and backup your data and what they do in case of emergency instead of throwing meaningless phrases like "99.999% uptime!" and "fully redundant storage backbone!" at you. Fully redundant storage backbone is nothing if it means it's built with some big arse proprietary SAN stuff where the whole array goes down if the main controller goes down. Which it of course does because it's a flaky embedded thing with 2k memory that has to be programmed in assembler and C with dangling memory pointers all over the place.

    • by Junta (36770)

      Very good advice. One issue is a *lot* of their users are commercial companies that viewed this as a way not to sweat the details at all. For many of those, if they have to sweat backup and all that, they might as well do the hosting themselves because the cost delta for them is not particularly large.

    • by dsouza42 (1151071)
      I don't know if it would help if they told you exactly how everything works. I'm sure no company with an infrastructure like Amazon's takes backups and safety very seriously. The availability numbers tell you what you can expect statistically from their services. The service that caused data loss is called EBS and acording to Amazon: "Amazon EBS snapshot can expect an annual failure rate (AFR) of between 0.1% – 0.5%, where failure refers to a complete loss of the volume". So if you have your data the
      • Be careful with your quoting from the middle of a sentence. When you quoted, "Amazon EBS snapshot can expect an annual failure rate (AFR) of between 0.1% – 0.5%, where failure refers to a complete loss of the volume," that gave the impression the EBS snapshots had an AFR of 0.1% – 0.5%. Actually, EBS snapshots are stored on S3, so they have a durability rate of 99.999999999% each year.

        It's EBS volumes, themselves, that have 0.1% – 0.5% annual failure rates. I'm sure you already knew this,

"The pyramid is opening!" "Which one?" "The one with the ever-widening hole in it!" -- The Firesign Theatre

Working...