Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Facebook IT

Making Facebook Self Healing 74

New submitter djeps writes "I used to achieve some degree of automated problem resolution with Nagios Event Handler scripts and RabbitMQ, but Facebook has done it on a far larger scale than my old days of sysadmin. Quoting: 'When your infrastructure is the size of Facebook's, there are always broken servers and pieces of software that have gone down or are generally misbehaving. In most cases, our systems are engineered such that these issues cause little or no impact to people using the site. But sometimes small outages can become bigger outages, causing errors or poor performance on the site. If a piece of broken software or hardware does impact the site, then it's important that we fix it or replace it as quickly as possible. ... We had to find an automated way to handle these sorts of issues so that the human engineers could focus on solving and preventing the larger, more complex outages. So, I started writing scripts when I had time to automate the fixes for various types of broken servers and pieces of software.'"
This discussion has been archived. No new comments can be posted.

Making Facebook Self Healing

Comments Filter:
  • by Psychotria ( 953670 ) on Saturday September 17, 2011 @11:26PM (#37432060)

    We had to find an automated way to handle these sorts of issues so that the human engineers could focus on solving and preventing the larger, more complex outages.

    This seems backwards to me. Surely the "larger, more complex outages" are caused by an accumulation of, or interaction between, the smaller, less complex problems/situations. If all of the smaller problems are well understood and dealt with, then those more complex problems should not arise. I think it's dangerous to assume that because the smaller problems can be transiently resolved by a script with minimal human intervention that the more complex problems need less exploration. Sure, scripts to handle the less complex issues are great, but this should not shift the focus of the human engineers to "focus on solving and preventing complex outages"; solving those often (always?) means solving the less complex issues.

  • Comment removed (Score:5, Insightful)

    by account_deleted ( 4530225 ) on Saturday September 17, 2011 @11:39PM (#37432086)
    Comment removed based on user account deletion
  • by Ethanol-fueled ( 1125189 ) on Saturday September 17, 2011 @11:42PM (#37432094) Homepage Journal
    The FBAR system will be ineffective against the outages caused by their users leaving in droves for the next big thing.

    I blindly clicked the TFA link without checking that it was a Facebook link. Once I was at the page, I was halfway through it when a box popped up telling me "Please log in to continue." I closed the box and nothing happened. If I was thinking about joining Facebook, I sure wouldn't now after seeing that shithead pop-up. Fuck Facebook - you guys get on your knees and suck my dick, you beg for my information, and only then I might just give you my real age.

    And what's up with Facebook's IPO? Do their investors have a bunch of invisible Disney Dollars stashed away in Uncle Scrooge's money vault?
  • by hardtofindanick ( 1105361 ) on Sunday September 18, 2011 @03:17AM (#37432542)
    It seems to me like you are creating hypothetical scenarios of total failure. Most of the practical failure scenarios can be handled gracefully when you have facebook's resources under your command. After all they are not sending men to Mars. We have studied and now well understand distributed database problems for more than 30 years. There is pretty much nothing technologically interesting about Facebook (and Twitter for that matter).

    The sad part is someone [linkedin.com] writes his ramblings and puts a flow chart or two and it becomes a story on /.

And it should be the law: If you use the word `paradigm' without knowing what the dictionary says it means, you go to jail. No exceptions. -- David Jones

Working...