Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
Virtualization IT

Chaos Monkey Released Into the Wild 76

Quince alPillan writes "Netflix revealed today that they've released Chaos Monkey, an open source Amazon Web Service testing tool that will randomly turn off instances in Auto Scaling Groups. 'We have found that the best defense against major unexpected failures is to fail often. By frequently causing failures, we force our services to be built in a way that is more resilient. We are excited to make a long-awaited announcement today that will help others who embrace this approach. ...source code for the founding member of the Simian Army, Chaos Monkey, is available to the community.'"
This discussion has been archived. No new comments can be posted.

Chaos Monkey Released Into the Wild

Comments Filter:
  • Into the wild? (Score:5, Informative)

    by dubl-u ( 51156 ) * <<ot.atop> <ta> <2107893252>> on Monday July 30, 2012 @08:38PM (#40824589)

    And by "into the wild", they mean they're now letting it run on other people's sites.

    • by jcoy42 ( 412359 ) on Monday July 30, 2012 @10:57PM (#40825273) Homepage Journal

      This is why we don't let you write headlines.

    • by Anonymous Coward

      I think the concept is good. That if the desire is to withstand failures of unforeseen natures, then test with random failures and observe how the software reacts to it.

      In practice, it will likely get you fired in a production environment, but I still think the idea behind it is sound.

      • Re:Into the wild? (Score:5, Insightful)

        by inKubus ( 199753 ) on Tuesday July 31, 2012 @02:14AM (#40825983) Homepage Journal

        Sound idea, sure. But not a substitute for good engineering. You see this issue come up again and again with these cloud services. The pressure from sales and marketing to move quickly and monetize the idea (and support lots of subscribers quickly) is not conducive to building a solid infrastructure. Netflix's approach is actually the exact opposite of Amazon's. Amazon's system is highly engineered and designed to resist failures that take down Amazon.com for it's customers. That is their number one goal. Amazon.com has not been down for a long time. AWS is an offshoot of that effort to resell their extra cycles but it's not nearly as engineered at the Amazon.com application built on top, which redirects around the globe and does lots of other things. It seems that AWS always has some new service coming out, but think about this: all those services were probably made by Amazon 3 years ago and they are just now releasing them to you..

        Netflix, on the other hand, seems to be just hacking together a site, if this is really what they primarily used to QA their application. What you're doing with this random failure thing is just statistically creating errors and finding bugs in failure handling code statistically. This means there's _up to_ an infinite number of bugs that will *not* be found with this method because they are unlikely or the tester is unlucky.

        It certainly has to do with the math of it, but it also has to do with the human culture that arises when working like this. See, with this brute force iterative programming, you are building a nest of patches. So what you are going to end up with is going to be more complicated and less functional than if you do the hard work. And that's the issue. Thinking about stuff in terms of thousands or millions of nodes is "too hard" so the aforementioned cloud providers keep coming up with "creative solutions" like this. (I remember reading about Facebook hacking mysql a few years back and shaking my head as well..) But, like "creative accounting", it may not be illegal but it may get you into trouble. You're never going to be absolutely sure the application will stay up and available. Ok, fine, so it Netflix goes down no ones going to die, but still...there's millions of dollars and subscriber goodwill at stake and that's not nothing.

        Anyway, don't think that I'm railing against creative testing, but they shouldn't think they are so clever as the release seems to imply they think they are ;)

        • by dave420 ( 699308 )
          That's a lot of guesswork... I don't see many links backing your positions up.
          • That's a lot of guesswork... I don't see many links backing your positions up.

            His positions are superficial and emotional, that's all.

        • Re:Into the wild? (Score:4, Insightful)

          by eyrieowl ( 881195 ) on Tuesday July 31, 2012 @08:11AM (#40827557)
          There are a lot of things that can go wrong in failover scenarios. Unless and until they are tested in real world situations, you can't be certain the system works. I happen to know of many systems which had failover processes which were "tested", and sounded fine on paper, but when it came to the real world, they had failed to account for this or that unexpected condition which ended up leading to far more downtime that was expected. If chaos monkey is their ONLY way of arriving at a resilient service, than sure, they have a deeper issue. But if they've spent time trying to design a solid system and then they're using Chaos Monkey to make sure it's as bullet-proof as they think it is, then it's good, solid engineering for the real world. I am reminded of the book "Inviting Disaster", on technology failures. All the systems described in the book which failed were well engineered systems. But due to a series of events working in concert, disaster happened. Any one link in the chain of failures wouldn't be enough; and it is not possible to fully engineer that out of your system; and certainly not possible to test for that in controlled testing environments. But if you can start causing failures in the real world (which is a luxury you have with systems that don't actually keep people alive), you have the opportunity to eliminate those sorts of weaknesses from the system. That's what I think is the value to something like this.
        • by Anonymous Coward

          The superiority of Amazon's engineering culture over Netflix must be the reason why during the last major EC2 outage, Netflix managed to stay up and operational...

          • by inKubus ( 199753 )

            To clarify what I specifically wrote in my post, Amazon.com (Amazon's application, where they make the money), has not been down in a long time. The Virgina EC2 outage only affected the excess capacity they resell to AWS customers. I'm not singling out Netflix and I'm not saying that this is a bad or horrible or un-useful tool. I appreciate all the stuff Netflix is open-sourcing.

        • by Anonymous Coward

          I've been using Netflix for years now - and I've only had trouble with the streaming service once, maybe twice in that time. Hulu often freaks out, Vudu the same - so I think the proof is in the pudding. Lastly, this is just an additional testing piece, and frankly it is very cool. No where did they say it was a substitute for good engineering.

        • If Netflix is hacking together a site, why is their HD streaming more reliably pleasing than any other online service, including places like Comcast, which presumably has 100x the engineers on hand? Maybe they are good at teh hacking?

        • by rwa2 ( 4391 ) *

          Meh, what's the point of good engineering if you never test it? I've heard of a quite a few wonderfully expensive and over-engineered UPS and RAID deployments that failed completely because they never bothered to actually test the procedures. The last company I worked at would often have regular "emergency power off" events where they'd do a complete shutdown of the entire datacenter triggered by various environmental factors. And you know what? More times than not they'd still find a system that someho

        • That this post was modded 5 is a sad testament to slashdot.

          Sound idea, sure. But not a substitute for good engineering.

          That argument only makes sense if it were the case that Netflix is using it in lieue of good engineering. But, it isn't, so...

          Also, this is a false dichotomy. Chaos Monkey is in great part a form of fault injection, which itself is part of good engineering.

          You see this issue come up again and again with these cloud services.

          Like amazon EC2?

          The pressure from sales and marketing to move quickly and monetize the idea (and support lots of subscribers quickly) is not conducive to building a solid infrastructure. Netflix's approach is actually the exact opposite of Amazon's.

          You know this from a fact, or is it pure speculation?

          Amazon's system is highly engineered and designed to resist failures that take down Amazon.com for it's customers. That is their number one goal. Amazon.com has not been down for a long time. AWS is an offshoot of that effort to resell their extra cycles but it's not nearly as engineered at the Amazon.com application built on top, which redirects around the globe and does lots of other things. It seems that AWS always has some new service coming out, but think about this: all those services were probably made by Amazon 3 years ago and they are just now releasing them to you..

          Great non sequitur.

          Netflix, on the other hand, seems to be just hacking together a site, if this is really what they primarily used to QA their application.

          Seems? Seems? First you state in very certain terms that Netflix is doing the exact opposite to Amazon. A

    • Re:Into the wild? (Score:4, Informative)

      by arkhan_jg ( 618674 ) on Tuesday July 31, 2012 @03:40AM (#40826283)

      Seems more about that they've just published the source code on github [github.com] under the Apache licence.

      So you can run your own chaos monkey on your own amazon cloud systems, or modify it to run on your private cloud, or whatever.

  • Very Erlang-y (Score:3, Informative)

    by Anonymous Coward on Monday July 30, 2012 @09:00PM (#40824703)

    We have found that the best defense against major unexpected failures is to fail often. By frequently causing failures, we force our services to be built in a way that is more resilient.

    Sounds like what has been common in Erlang for decades. [wikibooks.org]

    Off topic: when I watch the /. homepage, I am logged in. As soon as I click on a story, I become an Anonymous Coward. Did anybody else experience this bug too?

    • I don't have such a problem.

    • Off topic: when I watch the /. homepage, I am logged in. As soon as I click on a story, I become an Anonymous Coward. Did anybody else experience this bug too?

      Some would see this as a super power... of course they're already trolls, but mighty trolls with super powers.
      Seriously, I've seen something like this in FF with some privacy plugins, but it's been awhile.

    • You're probably disabling subdomain cookies. For instance right now we're not on slashdot.org, we're on it.slashdot.org.

  • by valentinas ( 2692229 ) on Monday July 30, 2012 @09:05PM (#40824727)
    I though this was about monkeys...
    • by Anonymous Coward

      I'm just wondering what makes Chaos Monkey different than Timetwister, and how much mana it costs.

      • by Anonymous Coward

        One black, one red, one green, one blue and one white mana + X ,where X is random. Throw any in play instant through the room: if it lands face-down eat a banana. If it lands face-up the instant is played a normal.

    • You would get approximately the same result if you let an actual monkey loose in your server room though.
  • What an apt name for our new help desk tech!
  • The Truth (Score:2, Informative)

    by paleo2002 ( 1079697 )
    War, famine, violence, addiction, pollution . . . truly, WE are the Chaos Monkeys!
  • Now we see the beginning of the Army of the 12 Monkeys. We're doomed ...
  • by CODiNE ( 27417 ) on Monday July 30, 2012 @10:48PM (#40825223) Homepage

    MonkeyLives [folklore.org]

    • Don't know if you noticed this:

      We kept our system flags in an area of very low memory reserved for the system globals, starting at address 256 ($100 in hexadecimal)

      100 bucks for an address?

      Cool story, though.

  • I love this thing (Score:4, Interesting)

    by ghostdoc ( 1235612 ) on Monday July 30, 2012 @11:41PM (#40825463)

    Not only for the idea that a serious company lets a masturbating-and-throwing-poo grinning idiot loose in their sensitive vitals, but also because it draws so many parallels with other resilient systems.

    Allergies cured by parasitical worms? Chaos Monkey Effect - you need something attacking your defences for your system to stay healthy

    Ecosystem that relies on bushfires to clear old vegetation? Chaos Monkey Effect

    Something almost Zen about not only turning an attacker's violence against them, but deliberately introducing new attackers so your system is strengthened by them.

    Well done chaps, carry on.

    • I was going to attack your attack of bushfires... until I re-read your allergy attack sentence and realized you have it right. Good work, fellow ecology nerd.
  • Java, meh (Score:5, Funny)

    by codepunk ( 167897 ) on Monday July 30, 2012 @11:58PM (#40825537)

    Leave it to some java developers to write 100k lines of code to do a shutdown -h now.

  • at first I was thinking the article was about this chaos monkey.
    http://www.wtop.com/681/2859976/Rock-throwing-chimp-plans-complex-attacks-on-visitors [wtop.com]

  • by Anonymous Coward

    Congrats team, give yourselves a slap in the face!

  • Except they need to randomly turn off the network connection in their test envronment. It's amazing how many mobile apps assume you'll always have a solid connection and never be in an elevator, or walking between tall buildings, or the basement of a convention center, or any other place with a spotty or overloaded signal.

  • I didn't tell anyone about the chaos monkey.... Oh. Its just some program. Carry on then.
  • The media war is getting serious. Chaos Monkeys? How about you get Stars back?
  • by retroworks ( 652802 ) on Tuesday July 31, 2012 @01:50AM (#40825895) Homepage Journal

    "We have found that the best defense against major unexpected failures is to fail often."

    In other words, you'll never be disappointed if you expect total incompetence. I've already achieved this same thing on my own with my Netflix account, by completely and utterly lowering my expectations.

  • Script kiddies are released on the internet to improve security by exploiting unchecked buffers and unsanitized inputs...
    Security of information at all time high.

    Errrr...

  • Jeff Atwood has an blog Working with the Chaos Monkey [codinghorror.com].
  • than it actually was.

    I was picturing a wild, multicolored, gene-spliced ball of fur tearing around, shoving badgers in lion's ears 'n shit.
  • Chaos Monkey start up get working Chaos Monkey is a hoot Chaos Monkey's preferred new target is instance of a group... (the rest is left as an exercise for Coulton fans)

Avoid strange women and temporary variables.

Working...