Chaos Monkey Released Into the Wild 76
Quince alPillan writes "Netflix revealed today that they've released Chaos Monkey, an open source Amazon Web Service testing tool that will randomly turn off instances in Auto Scaling Groups. 'We have found that the best defense against major unexpected failures is to fail often. By frequently causing failures, we force our services to be built in a way that is more resilient. We are excited to make a long-awaited announcement today that will help others who embrace this approach. ...source code for the founding member of the Simian Army, Chaos Monkey, is available to the community.'"
Re:Into the wild? (Score:5, Insightful)
Sound idea, sure. But not a substitute for good engineering. You see this issue come up again and again with these cloud services. The pressure from sales and marketing to move quickly and monetize the idea (and support lots of subscribers quickly) is not conducive to building a solid infrastructure. Netflix's approach is actually the exact opposite of Amazon's. Amazon's system is highly engineered and designed to resist failures that take down Amazon.com for it's customers. That is their number one goal. Amazon.com has not been down for a long time. AWS is an offshoot of that effort to resell their extra cycles but it's not nearly as engineered at the Amazon.com application built on top, which redirects around the globe and does lots of other things. It seems that AWS always has some new service coming out, but think about this: all those services were probably made by Amazon 3 years ago and they are just now releasing them to you..
Netflix, on the other hand, seems to be just hacking together a site, if this is really what they primarily used to QA their application. What you're doing with this random failure thing is just statistically creating errors and finding bugs in failure handling code statistically. This means there's _up to_ an infinite number of bugs that will *not* be found with this method because they are unlikely or the tester is unlucky.
It certainly has to do with the math of it, but it also has to do with the human culture that arises when working like this. See, with this brute force iterative programming, you are building a nest of patches. So what you are going to end up with is going to be more complicated and less functional than if you do the hard work. And that's the issue. Thinking about stuff in terms of thousands or millions of nodes is "too hard" so the aforementioned cloud providers keep coming up with "creative solutions" like this. (I remember reading about Facebook hacking mysql a few years back and shaking my head as well..) But, like "creative accounting", it may not be illegal but it may get you into trouble. You're never going to be absolutely sure the application will stay up and available. Ok, fine, so it Netflix goes down no ones going to die, but still...there's millions of dollars and subscriber goodwill at stake and that's not nothing.
Anyway, don't think that I'm railing against creative testing, but they shouldn't think they are so clever as the release seems to imply they think they are ;)
Re:Into the wild? (Score:4, Insightful)