Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
Virtualization IT

Chaos Monkey Released Into the Wild 76

Quince alPillan writes "Netflix revealed today that they've released Chaos Monkey, an open source Amazon Web Service testing tool that will randomly turn off instances in Auto Scaling Groups. 'We have found that the best defense against major unexpected failures is to fail often. By frequently causing failures, we force our services to be built in a way that is more resilient. We are excited to make a long-awaited announcement today that will help others who embrace this approach. ...source code for the founding member of the Simian Army, Chaos Monkey, is available to the community.'"
This discussion has been archived. No new comments can be posted.

Chaos Monkey Released Into the Wild

Comments Filter:
  • Re:Into the wild? (Score:5, Insightful)

    by inKubus ( 199753 ) on Tuesday July 31, 2012 @03:14AM (#40825983) Homepage Journal

    Sound idea, sure. But not a substitute for good engineering. You see this issue come up again and again with these cloud services. The pressure from sales and marketing to move quickly and monetize the idea (and support lots of subscribers quickly) is not conducive to building a solid infrastructure. Netflix's approach is actually the exact opposite of Amazon's. Amazon's system is highly engineered and designed to resist failures that take down Amazon.com for it's customers. That is their number one goal. Amazon.com has not been down for a long time. AWS is an offshoot of that effort to resell their extra cycles but it's not nearly as engineered at the Amazon.com application built on top, which redirects around the globe and does lots of other things. It seems that AWS always has some new service coming out, but think about this: all those services were probably made by Amazon 3 years ago and they are just now releasing them to you..

    Netflix, on the other hand, seems to be just hacking together a site, if this is really what they primarily used to QA their application. What you're doing with this random failure thing is just statistically creating errors and finding bugs in failure handling code statistically. This means there's _up to_ an infinite number of bugs that will *not* be found with this method because they are unlikely or the tester is unlucky.

    It certainly has to do with the math of it, but it also has to do with the human culture that arises when working like this. See, with this brute force iterative programming, you are building a nest of patches. So what you are going to end up with is going to be more complicated and less functional than if you do the hard work. And that's the issue. Thinking about stuff in terms of thousands or millions of nodes is "too hard" so the aforementioned cloud providers keep coming up with "creative solutions" like this. (I remember reading about Facebook hacking mysql a few years back and shaking my head as well..) But, like "creative accounting", it may not be illegal but it may get you into trouble. You're never going to be absolutely sure the application will stay up and available. Ok, fine, so it Netflix goes down no ones going to die, but still...there's millions of dollars and subscriber goodwill at stake and that's not nothing.

    Anyway, don't think that I'm railing against creative testing, but they shouldn't think they are so clever as the release seems to imply they think they are ;)

  • Re:Into the wild? (Score:4, Insightful)

    by eyrieowl ( 881195 ) on Tuesday July 31, 2012 @09:11AM (#40827557)
    There are a lot of things that can go wrong in failover scenarios. Unless and until they are tested in real world situations, you can't be certain the system works. I happen to know of many systems which had failover processes which were "tested", and sounded fine on paper, but when it came to the real world, they had failed to account for this or that unexpected condition which ended up leading to far more downtime that was expected. If chaos monkey is their ONLY way of arriving at a resilient service, than sure, they have a deeper issue. But if they've spent time trying to design a solid system and then they're using Chaos Monkey to make sure it's as bullet-proof as they think it is, then it's good, solid engineering for the real world. I am reminded of the book "Inviting Disaster", on technology failures. All the systems described in the book which failed were well engineered systems. But due to a series of events working in concert, disaster happened. Any one link in the chain of failures wouldn't be enough; and it is not possible to fully engineer that out of your system; and certainly not possible to test for that in controlled testing environments. But if you can start causing failures in the real world (which is a luxury you have with systems that don't actually keep people alive), you have the opportunity to eliminate those sorts of weaknesses from the system. That's what I think is the value to something like this.

Arithmetic is being able to count up to twenty without taking off your shoes. -- Mickey Mouse

Working...