Amazon EC2 Crash Caused Data Loss 112
Relayman writes "Henry Blodget is reporting that the recent EC2 crash caused permanent data loss. Apparently, the backups that were being made were not sufficient to recover the lost data. Although a small percentage of the total data was lost, any data loss can be bad to a Website operator."
Um, backups? (Score:1)
srsly, as in your own
I am not rightly able to comprehend... (Score:5, Insightful)
... the confusion of ideas that would lead someone to treat their live web server as their primary/master data repository.
I guess I'm still stuck in Commodore 64 World, or something..
Re: (Score:2)
Re: (Score:1)
just business as usual.
Re: (Score:3)
I'm not so sure about rigorous...
1- I personnally have never lost a single byte of meaningful data
2- do amazon detail their exact procedures and commitments ?
3- do amazon backup those "commitments" with hard cash ? How much will the people whose data they lost be compensated ?
read the sig....
Re: (Score:2)
They're crediting all accounts that had any activity in the USA-East region for 10 days of usage, regardless if they were affected.
Remember that it was EC2 that was affected, which is just a virtual machine with volatile storage. Had it been S3 data that was lost one should expect restitution, but in this case downtime and data loss is ultimately the fault of the user.
Re: (Score:2)
Re:I am not rightly able to comprehend... (Score:4, Interesting)
That depends. Only a couple of our servers in that availability zone were actually affected, but we're apparently being compensated as though all of them were. Bonus for us.
Re: (Score:3)
1- I personnally have never lost a single byte of meaningful data
Yep--the moment I accidentally 'rm -rf /', I simply re-classify the drive as 'not containing meaningful data' and my stats are saved.
Re:I am not rightly able to comprehend... (Score:5, Informative)
It took something pretty catastrophic to bring it down and cause data lass
Catastrophic would be an earthquake, tsunami and meltdown, in that order. From my reading of the situation amazon stuffed up their own replication mechanism and it recursively replicated the system to fill up the available hardware. Thats just bad design. Its obvious they did no testing under realistic conditions.
Re: (Score:2)
Re: (Score:2)
Its obvious they did no testing under realistic conditions.
And how do you test under realistic conditions when those realistic conditions are an enormous, ~10 datacenter system that serves a good percentage of the internet?
Like in Contact: you build two of them.
Re: (Score:2)
Its obvious they did no testing under realistic conditions.
And how do you test under realistic conditions when those realistic conditions are an enormous, ~10 datacenter system that serves a good percentage of the internet?
I work on air traffic control systems. In our environment the whole system, including the bit between the keyboard and the seat, is intensively exercised in realistic conditions. Simulation modes are built in. Its expensive but thats the way to deploy a complex system which works reliably.
Re: (Score:3)
well, the a data center run by amazon certainly has more rigorous backup and maintenance schedules than anything I could personally come up with
It's funny. Not a single place I've worked at has had as good of backups as I have for my personal stuff. And I didn't even spend 6 figures for some useless enterprise backup solution. Some scripting, cp -al, rsync, dmcrypt, ssh and a remote PC at my girlfriends house and you have an incremental backup solution more secure and more robust than any enterprise solution I've ever seen, and it only cost a couple hundred for the drives.
Re: (Score:2, Insightful)
Re: (Score:2)
Re: (Score:2)
He didn't claim that his backup system was perfect, just better than what many enterprise do, which is probably true.
> Best not get on her bad side for now, anyway....
Bah, if he is wise his backups are encrypted, so this shouldn't be a big issue (unless he has a bad break-up with his GF and loose data at the same time: Murphy's law in action).
Re: (Score:2)
This is definitely better than Amazon's backup plan. It backs up data and LITERALLY screws you. Amazon just screws you.
Re: (Score:3)
Unless of course you're one of those people who refers to female friends as "girlfriends", in which case, I hate you.
Re:I am not rightly able to comprehend... (Score:5, Funny)
Re: (Score:2)
Unless of course you're one of those people who refers to female friends as "girlfriends", in which case, I hate you.
Women do that too, but this is /.
Re: (Score:1)
Re: (Score:2)
Until she dumps you and throws your backup drives out her window that is.
She'd have to come here and trash those also. That'd be a trick though. I have some 10 computers spread over several rooms here and a dozen or more external drives. I never said it was perfect. Just better than any enterprise setup I've seen. The malicious insider is always the toughest hole to cover in any data protection scheme.
Re: (Score:2)
Bigger solutions are invariably more complicated. And when they're more complicated, there's more to go wrong - and when it does go wrong, there's more that can be affected.
This is why I'm quite wary of people throwing the word "Enterprise" around. IME, it's frequently a codeword meaning "A proprietary vendor has told us their product can be all things to all men - which is technically true but what we're buying needs many more man-hours of work to turn it into anything for anyone than we can hope to dedica
Re: (Score:2)
My backup system is no where near the studio I work at's backup system. But I could deploy something even easier for one simple reason: I have less data.
Do you know how much it would cost to remote push 10TB of data once a week?
Re: (Score:2)
Re: (Score:2)
Do you know how much it would cost to remote push 10TB of data once a week?
If you're generating 10TB of data weekly you're talking about an extremely rare situation that would require a specialized solution anyway since no backup solution out there can support that.
If your talking about have around 10TB of data total that's what rsync is for. You do fast incrementals on the local system over a high bandwith pipe (sata, sas). Then you have 2 options depending on your system requirements. You either run a daily rsync of one of the incermentals directly offsite over a separate netw
..at my girlfriends house (Score:1)
What's a girlfriend?
Re: (Score:1)
"Wonder what capacities they come in."
There's a lot of variance, and bigger isn't necessarily better. Most capacities are specified in the form "x-y-z w/ nX", where x, y, z, and n are numbers, and X is an alphabetic designation that may consist of multiple letters. Many people attracted to women prefer to maximize x and z (while still having them be nearly equal) while minimizing y and n, and want a designation of "C" or "D" used for X.
Re: (Score:2)
I've pondered the same thing. My workplace spends 6 figures, and as far as I'm able to tell, gets significantly less than I have at home, despite my investment being 2 orders of magnitude less.
Every 2 hours for the last day, every day for the last week, every week for the last month, every month for the last year, every year forever. Physically backed up to 2 distinct discs inhouse (one of which is pretty burglar-proof, living in safe), and 2 encrypted copies under the care of 2 distinct companies, in diffe
Re: (Score:2)
Just curious; what 5TB worth of personal data requires a 4-figure backup spending?
Re: (Score:1)
Re: (Score:2)
Just curious; what 5TB worth of personal data requires a 4-figure backup spending?
Porn
Re: (Score:2)
Where things start to get more complicated is when the data being stored requires some massaging before you can take a copy - or for that matter if you can only take a copy under specific circumstances or your copy is only useful under specific circumstances.
For instance:
Most modern databases store their data in files on the disk. Database transactions are atomic, sure. And (hopefully, assuming a modern FS) so are disk transactions. This does not mean, however, that you can simply copy the underlying fil
Re: (Score:1)
Re: (Score:2)
It doesnt take 6 figures, and if it does and isnt as good as Rsync, you need to find a new line of work.
Any backup solution should include tape, for a very simple reason-- you end up with multiple offline copies of data at about $25 per terabyte. Your backup solution sounds like it gets knocked out if someone introduces bogus data into your system; once the backup occurs, it overwrites all your good backups.
A good backup system isnt even that expensive; about $4000 will get you an LTO4 autoloader with a fu
Re: (Score:2)
Your backup solution sounds like it gets knocked out if someone introduces bogus data into your system; once the backup occurs, it overwrites all your good backups.
Ummm...you need to look at what cp -il and rsync -H does. Let's just say you're completely and utterly wrong to be polite about it. I have exact images of my systems at 4 hour intervals available instantly just by going to my backup drive. No need to go digging through tapes, spending hours looking through dumps and incrementals or trying to figure out if it's on an offsite tape. I suspect you're one of those people who set up the backups for one the companies I've worked for. Tape is utterly useless in thi
Re: (Score:2)
I have used (and still do) use Rsync-- i missed the -H option in your post, but regardless, all of your history is in one place.
The biggest problem I have seen with Rsync is that in directory structures with tens of thousands of tiny files, it takes a VERY long time to search for changes, which can be a problem if you have, for example, database files which need to be taken out of use for backup (the particular database im dealing with doesnt have a "dump" command, as it, itself, is used for remote backup a
Re: (Score:2)
all of your history is in one place.
No, it's not. It's here and at my offsite location (girlfriends house atm).
Also, good luck rotating 24TB of removable storage; with a budget of $6500
There's no rotation required. Once it's set up the only time you have to touch a drive is to replace a bad one. With well less than half that I could build a backup server with 24TB. For that much I could have RAID 1 on the backup server. Not only that but I can have backup images at, say, 4 hour intervals instantly available at any given moment. No trying to get an offsite tape hoping they get the right one or haven't lost it or get
Re: (Score:2)
No, it's not. It's here and at my offsite location (girlfriends house atm).
Which means you have 2 copies. The tape system I will be setting up this week (~$5000) will give me 20 copies, on LTO4 tapes. Thats 20 entire(ish) backups of our network, with the ability to roll back to any date within the last 3 weeks (our supplemental backup system covers dates further back).
With well less than half that I could build a backup server with 24TB. For that much I could have RAID 1 on the backup server.
Have fun with your error and drive failure rates. 24 drives @ 2TB each (RAID1) means an awful lot of failures each year. Thats also a LOT of arrays to present to your system-- in order to have any reasonable kind
Re: (Score:2)
Sorry for double post, I also did want to point out that here
Also, good luck rotating 24TB of removable storage; with a budget of $6500
That $6500 budget went towards both the autoloader, and ~140TB of tape storage-- that is, 5 or so sets of ~24TB each (100 tapes).
Tape is currently $16/TB or so. HDD platters, for the absolute cheapest deals (1TB drives), are around $40/TB-- 3x the price-- and require RAID cards to drive them, as well as hotswap cages. You will not be able to match tape prices for a long, long time, if ever.
Re: (Score:2)
And a fat lot of good that did.
Re: (Score:1)
bring it down and cause data lass.
I'm a lad, you insensitive clod!
Re: (Score:1)
"the a data center run by amazon certainly has more rigorous backup and maintenance schedules than anything I could personally come up with, offhand."
Differential backups every five minutes across three different backup systems rolling RAID-6, one local and two remote.
I don't have data loss issues, EVER.
And I'm just a low-level tech guy. If Amazon can't get it right and I can, something is wrong.
Re: (Score:1)
I am not rightly able to comprehend how some whose primary data source is people entering data via their website could have it elsewhere. Sure, they can have copies elsewhere, but those would be the backups.
Re:I am not rightly able to comprehend... (Score:5, Informative)
From a look at the linked article, it seems that one of the issues is data generated by these web sites. Such as user statistics, or user uploaded content, etc. That naturally lives primarily on the live web server and is also data that you don't want to lose. Also as other commenters mentioned as well the EC2 service is not a cloud-storage server, it's a web hosting service, and web hosts tend to indeed generate their own data.
This data of course needs to be backupped actively, and one would expect a web host to include that in its service. That's one of the reasons to pay for such a service, instead of doing it yourself.
Besides relying on their backups it's of course a good idea to regularly take backups yourself. But even if you do this daily, it means you may lose up to a day's worth of data. And that's (partly) what happened here. It's similar to someone who takes a photo on a digital camera, and subsequently loses that camera and the photo with it. You don't say "they shouldn't use a camera as primary data repository". It isn't. It's a temporary repository, and when the data is generated it's the one and only repository, simply pending copying to backup media.
Re: (Score:2)
Re: (Score:2)
I guess I'm still stuck in Commodore 64 World, or something..
Cassette tapes? I'm so very sorry.
Lost data? (Score:3)
Clouds are ephemeral (Score:2)
Who knew?
Re:Clouds are ephemeral (Score:5, Informative)
Cloud applications hosted on Amazon survived this incident without issue, as expected. Only the regular old hosted applications had problems with the outage. They were never "the cloud" to begin with, so I'm not sure why the term even comes up in this discussion.
The cloud represents a black box that hides the underlying network topology so that there are no single points of failure. Cloud applications are tolerant because they are spread through different datacenters across multiple points of in world. A catastrophe at one or more datacenters will have no noticeable effect on the availability of a cloud application because it continues to run in many more.
Amazon offers a few cloud applications: S3 comes to mind. But Amzon's EC2/EBS hosting service is a plain old hosting service like any other. The EC2 topology is not hidden away from you. You have to make active decisions about where you want your EC2 instance to live. That goes against the idea of the cloud. What Amazon does offer in EC2 is the tools necessary for you to build a cloud application, but not everything hosted on EC2 is a cloud application by default.
Re: (Score:1)
When you put it like that, it's hard to just bash things.
Stupid facts.
Re: (Score:2)
You're right, by default EC2 isn't a cloud solution, but Amazon doesn't help alleviate that confusion (from their website for EC2):
Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable compute capacity in the cloud. It is designed to make web-scale computing easier for developers.
It doesn't help when the name of the service (EC2) has the word "Cloud" in it.
Re: (Score:2)
Re: (Score:2)
That's exactly what it is.
It's confusing a lot of people because something sold as a cloud application (ie. SaaS) may or may not be designed with HA in mind and the vendor likely won't tell you. If it is, and the underlying infrastructure is sound, you're probably OK. Hopefully.
If it's not, it's not much different to an application running on some server in a co-lo somewhere, the only real difference is that you don't lease the server directly and you're not responsible for any backups.
Then you've got vir
Re: (Score:2)
In networking, the cloud has always represented an abstract network whose implementation details are unknown – it just magically works, thanks to the hard efforts of third parties. It is only lately that some marketing types want to exploit the term to make it mean something else.
What is S3? (Score:5, Informative)
EC2 is not meant to be used for data storage, that is what S3 is designed for. You store data and backups on S3, and use EC2 to serve high bandwidth websites to the masses.
Re: (Score:1)
And that's exactly how EBS is supposed to be backed up - it saves snapshots all the time to S3. Small and cheap incremental backups stored to a 99.999999999% durable [amazon.com] storage area. But apparently, Amazon messed up the backed up copies as well - instead of producing an outdated, but valid snapshot, they replied to affected customers with:
Re: (Score:1)
The quote is about the snapshots they took before they started recovery efforts. If you took snapshots of your EBS volumes regularly, these were not affected at all...
Re: (Score:2)
EC2 is not meant to be used for data storage, that is what S3 is designed for. You store data and backups on S3, and use EC2 to serve high bandwidth websites to the masses.
I don't think this is a fair criticism of people who lost data.
S3 isn't designed as an online datastore for live applications. Sure, I can put any content in there that I want, but it can't be up-to-the-millisecond.
AWS said to consider EBS volumes to be like hard disks, with a similar failure rate to hard disks. I forget the expected failure rate that they posted, but I think it was roughly between 1:100 and 1:1000 EBS volumes should be expected to fail each year. So go ahead and make your usual solutions w
Re: (Score:2)
I for one am glad to be connected, and obviously so are many others. Don't use services that aren't good for you - there are some cloud based services that are great, and some that aren't. It's pretty clear that in the future, things will be more
Re: (Score:2)
There's something simplistically technocratic about assuming that what is now is better than what has been.
Buy X! It's newer, thus better, than Y!
Because the economy's like a religion and set up so people lose their jobs and their homes if you don't needlessly produce and consume nothing of value.
Re: (Score:2)
Yes, because building your own datacentre, or paying hosting fees to a five-nines-plus facility, costs nothing. Air conditioning, batteries, generators, fire supression, multiple, redundant network connectivity: that stuff''s all free. A mainframe solves it all!!
Look, a quality DC costs millions to build or tens of thousands to rent space in. Servers and mainframes cost money to manage, support and spare out. If you're starved for capital, why wouldn't you use EC2+EBS+S3 for a few bucks a month, rather t
Did this save Wikileaks? (Score:2, Funny)
Availability zones (Score:1)
Re: (Score:1)
Availability zones are just one part of the picture for a good software design in data centers. While they are isolated from each other, they may still be in the same location (or within the same general area), and could more easily be hit with a network partition, floods, tornadoes, etc. Instead, as most big businesses know (like Netflix, which didn't suffer from the outage), you need regional separation in addition to availability zones. Amazon does provide this functionality, it just costs more, and most
Re: (Score:2)
Availability Zones are distinct locations that are engineered to be insulated from failures in other Availability Zones and provide inexpensive, low latency network connectivity to other Availability Zones in the same Region. By launching instances in separate Availability Zones, you can protect your applications from failure of a single location.
Better than use different region, I think it is better have multiple cloud providers...
Post morten Amazon explanation (Score:5, Informative)
http://aws.amazon.com/message/65648/ [amazon.com]
any data loss can be bad to a Website operator. (Score:2)
any data loss can be bad to a Website operator.
any data loss is catastrophic, if it's your data. They claim "a small percentage" of data was lost... 1% is a small percentage... 10% is also small percentage, but it's a huge amount of data.
Fortunately where I live and work there isn't really sufficient and reliable connectivity to "the cloud" to make it a worthwhile endeavor, so hopefully all the mistakes are learnt from before I have to worry about it.
Re: (Score:2)
any data loss is catastrophic, if it's your data.
No, it's only catastrophic if you're an idiot. Then again many website operators seem to be just that given how many need to use google cache to recover data after their web provider's server croaks.
Anyway, having your data in any single unreliable location is a recipe for disaster. And yes, with a 0.5-1% annual failure rate EBS is unreliable and no one claims otherwise. If you want reliable you use S3 and off-site backups.
Re: (Score:2)
No, it's only catastrophic if you're an idiot. Then again many website operators seem to be just that given how many need to use google cache to recover data after their web provider's server croaks.
Anyway, having your data in any single unreliable location is a recipe for disaster. And yes, with a 0.5-1% annual failure rate EBS is unreliable and no one claims otherwise. If you want reliable you use S3 and off-site backups.
Please explain to me how I can keep my data in S3 and/or offsite backups up-to-the-millisecond.
I'll wait.
Re: (Score:2)
And who is forcing you to use E2 and EBS? Is there a gun to your head? Why in god's name are you using an infrastructure that is clearly not compatible with your needs?
Re: (Score:2)
And who is forcing you to use E2 and EBS? Is there a gun to your head? Why in god's name are you using an infrastructure that is clearly not compatible with your needs?
Was that supposed to be an answer to my question? Is every application that fails to store all of its data immediately in S3 "clearly not compatible" with EC2?
Re: (Score:2)
If losing the intermittent data is catastrophic then yes they're not compatible. Find a different solution.
That said, database replication is a very old problem and solutions exist to that. Likewise, some applications simply don't suffer too much from losing a bit of data so the cost of that is low. Other applications have no data to lose since they're simply acting as data serving platforms. And that's all for web applications which aren't quite what EC2 was made for, after all it's called "elastic compute
Re: (Score:2)
That said, database replication is a very old problem and solutions exist to that.
Thank you for at least attempting to answer my question.
While your answer does not involve the use of S3, it is exactly the answer that should have worked, but did not in the case of yesterday's outage. Replicate your database to a different availability zone. Great, except EBS failed region-wide, so your slave database just died, too.
Not that I'm really even all that upset about the outage. My application was down for a few hours and degraded for a few more hours until I could replay the transactions that
Amazon is pretty up-front about expected data loss (Score:1)
Unless you pay extra, they say you can expect to lose data stored in S3 on a regular basis. There's nothing wrong with that per se, but it's something you need to plan for.
S3:
http://aws.amazon.com/s3/ [amazon.com]
EBS:
Clarification (Score:3)
The durability you quote for S3 (99.99%) is for the reduced redundancy option. The standard storage lists 99.999999999% durability.
Store a backup yourself (Score:3)
This is not the first time I've heard about a big hosting centre losing data even though it never happens, and they are keeping backups, etc.
It if it's at all manageable, keep one copy safe at your own place in addition to the replication at the hosting centre. You can set up a cheap box at the office with a couple of terabytes disk space and suck down the data periodically with something like rsync and rdiff-backup. It's not a whole lot of work and can make the difference between having a big problem and total disaster.
It would help if hosting centres actually told you how exactly they store and backup your data and what they do in case of emergency instead of throwing meaningless phrases like "99.999% uptime!" and "fully redundant storage backbone!" at you. Fully redundant storage backbone is nothing if it means it's built with some big arse proprietary SAN stuff where the whole array goes down if the main controller goes down. Which it of course does because it's a flaky embedded thing with 2k memory that has to be programmed in assembler and C with dangling memory pointers all over the place.
Re: (Score:2)
Very good advice. One issue is a *lot* of their users are commercial companies that viewed this as a way not to sweat the details at all. For many of those, if they have to sweat backup and all that, they might as well do the hosting themselves because the cost delta for them is not particularly large.
Re: (Score:1)
Re: (Score:2)
Be careful with your quoting from the middle of a sentence. When you quoted, "Amazon EBS snapshot can expect an annual failure rate (AFR) of between 0.1% – 0.5%, where failure refers to a complete loss of the volume," that gave the impression the EBS snapshots had an AFR of 0.1% – 0.5%. Actually, EBS snapshots are stored on S3, so they have a durability rate of 99.999999999% each year.
It's EBS volumes, themselves, that have 0.1% – 0.5% annual failure rates. I'm sure you already knew this,