Making Facebook Self Healing 74
New submitter djeps writes "I used to achieve some degree of automated problem resolution with Nagios Event Handler scripts and RabbitMQ, but Facebook has done it on a far larger scale than my old days of sysadmin. Quoting: 'When your infrastructure is the size of Facebook's, there are always broken servers and pieces of software that have gone down or are generally misbehaving. In most cases, our systems are engineered such that these issues cause little or no impact to people using the site. But sometimes small outages can become bigger outages, causing errors or poor performance on the site. If a piece of broken software or hardware does impact the site, then it's important that we fix it or replace it as quickly as possible. ... We had to find an automated way to handle these sorts of issues so that the human engineers could focus on solving and preventing the larger, more complex outages. So, I started writing scripts when I had time to automate the fixes for various types of broken servers and pieces of software.'"
Suggested change... (Score:2, Offtopic)
I know Slashdot has a tradition of being a "free-for-all, run through a blender" but I don't think there has ever been an AC first post that has ever been anything but either:
- So lame that you wonder how a person manages to survive such a terminal case of lack of personality or creativity... or
- There is no real reason it couldn't have been posted under a login.
Re: (Score:2)
s/cosmonaut/confidant/
Maybe You have confused Zuckerberg with Guy Laliberté or Mark Shuttleworth. Or perhaps with Richard Branson, who builds space tourism vehicles.
The song, however, has nothing to do with space travelers.
Complexity arising from simplicity (Score:3, Insightful)
We had to find an automated way to handle these sorts of issues so that the human engineers could focus on solving and preventing the larger, more complex outages.
This seems backwards to me. Surely the "larger, more complex outages" are caused by an accumulation of, or interaction between, the smaller, less complex problems/situations. If all of the smaller problems are well understood and dealt with, then those more complex problems should not arise. I think it's dangerous to assume that because the smaller problems can be transiently resolved by a script with minimal human intervention that the more complex problems need less exploration. Sure, scripts to handle the less complex issues are great, but this should not shift the focus of the human engineers to "focus on solving and preventing complex outages"; solving those often (always?) means solving the less complex issues.
Comment removed (Score:5, Insightful)
Re: (Score:2)
Larger outages in an infrastructure like Facebook's are only rarely an accumulation of smaller issues. Think about it: what's a more likely scenario for a major site-wide issue, thousands of web servers whose hard drives die simultaneously, or a flapping route caused by a configuration issue on a router?
Sometimes. But for example, suppose you have a fail-over setup so that if one machine falls over, its work units or clients are automatically transferred to another machine. You're very proud of yourself until you get a damaged work unit or client which is capable of causing the machine processing it to fall over, and then it gets transferred all around to every server and causes a cascade failure until 30 seconds later all of your servers have crashed.
And sometimes you do get simultaneous "independent" fai
Re:Complexity arising from simplicity (Score:4, Insightful)
The sad part is someone [linkedin.com] writes his ramblings and puts a flow chart or two and it becomes a story on
Re: (Score:1)
Re:Complexity arising from simplicity (Score:4, Informative)
TFA specifically uses an example of a failed hard drive to describe the workflow. You can see that a failed hard drive is something small, easily diagnosable, and -- in the greater scheme of things -- easily fixable.
Now, if you recall what happened with AWS in April, they had a low-bandwidth management network that all of a sudden had all primary EBS API traffic shunted to it. This was caused by a human flipping a network switch when they shouldn't have. Something like this is not something that happens all the time, has little, if any diagnosable features, is not well-defined to have a proper workflow attached to it, and needs human engineers to correct. This is an example of a complex, large-scale problem.
Read the article, it's actually quite interesting.
Re: (Score:2)
Now, if you recall what happened with AWS in April, they had a low-bandwidth management network that all of a sudden had all primary EBS API traffic shunted to it. This was caused by a human flipping a network switch when they shouldn't have. Something like this is not something that happens all the time, has little, if any diagnosable features, is not well-defined to have a proper workflow attached to it, and needs human engineers to correct. This is an example of a complex, large-scale problem.
I wonder when this army of automated-problem-fixing engines will encounter a corner case its masters never considered and how it will react.
I give the ops guys at Facebook a lot of credit for managing such a gigantic workload with just a (relatively) few, very smart, people. Amazon also has a lot of smart people who have been working on EBS (in one form or another) since before Facebook was founded. These systems just interact in unpredictable ways when they get out of their comfort zone.
Systems so complica
NOOOOOO!! (Score:5, Funny)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Fire and acid, my friend.
Re: (Score:1)
Assisted Suicide (Score:2)
I was thinking more in terms of "assisted suicide".
Re: (Score:2)
I thought of renaming it to Palliabook, but then look what I found at Wikipedia:
"Palliative care (from Latin palliare, to cloak) is a specialized area ..."
I guess Cloakbook would also be correct.
Re: (Score:2)
Re: (Score:2)
How are we supposed to kill it if it's self-healing? Now it will never die!
Wait until Microsoft buys it, then give it another year or two...
Re: (Score:1)
Maybe Facebook is really the Skynet that we learned about in the Terminator movies. I fear the day that it becomes self-aware.
Re: (Score:1)
How are we supposed to kill it if it's self-healing? Now it will never die!
Make sure the halon system is not computer controlled.
Kill internet connections to all sites at the same time, so they cant send out an SOS
Then kill the power
Re: (Score:2)
"Today, the FBAR service is developed and maintained by two full time engineers, but according to the most recent metrics, itâ(TM)s doing the work of approximately 200 full time system administrators"
Which doesn't really tell anyone anything. Who expresses amount of work done in terms of number of full time workers? In case anyone had failed to get the message, the above shows that such a metric isn't very useful. Perhaps the message here is that 2 really effective people can do the work of 200 not so
Routing around the faulty components (Score:1)
We had to find an automated way to handle these sorts of issues so that the human engineers could focus on solving and preventing the larger, more complex outages.
Given how glitchy Facebook was in the past, I can't help but be reminded of this comic [smbc-comics.com].
Re: (Score:2)
And amusingly enough, the SMBC site is down now, so I can't reach your link.
Re: (Score:1)
amazingly, smbc has been down for hours now. i've never seen any site go down for so long.
Feature request. (Score:1)
Could they do the world a favour and write scripts to make it self-terminate instead?
Every generation wants to re-invent the wheel. (Score:2)
I was rolling out Big Brother Network Monitor a decade ago. It was well capable of doing this.
Today, I'd use an RDB that stored output from perl:DBI cronjobs running on each machine, and another job that checked the db and made sure all that ought to be happening had reported in successfully recently. Anything that hadn't would trigger an email to someone to look into it.
Easy to develop, implement, extend, and maintain.
No, I don't want to connect to FB just to read the article. Post it somewhere else if
Re: (Score:1)
At which time the process will message: "Mission accomplished."
Re: (Score:1)
You'd re-invent Nagios, but worse?
Re: (Score:2)
Objections noted, but I'm unconvinced any are show-stoppers.
Writing into a shared database via cronjobs on different boxes has a few implications: ...
-the credentials do share write access to the database - if not per user account, then per permission. You usually don't give each host its own table to log into
Why? I would give each host its own table, or perhaps a small block of machines one table. This is hardly going to be a vast blob of data going back and forth here. Besides, it doesn't all have to go into one db, nor one db on one machine. Hell, it could be a db on each machine with exports scp'd to a central log server (or ten).
If a single box is misbehaving (i.e. the hostname "got lost"), you'll end up searching for that box forever - unless you also started logging hostname and IP address, both retrieved via the SQL connection and not as cronjob output.
That makes no sense to me. No, I've never worked anywhere that had 30k hosts on-line, but simple documentation pract
Sounds like a good place to work (Score:4, Interesting)
Damn, if I weren't so adverse to soul crushing rejection, I'd apply.
This guy was insightful and informative, so I believe what is quoted above.
And I'm surprised: I figured Facebook would be either more bureaucratic (like MS) or kinda dickishly autocratic (like Zuckerberg is rumoured to be).
Re: (Score:2)
If the site is often broken and randomly changing, this would probably be why. You do want people experimenting and finding fixes, but if you don't have any coordination going on that's just as bad.
Re: (Score:2)
But HEY! At least our employees feel like they are empowered and important and we still get to have a fuseball table in the conference room, right? I truly cannot take a
Re: (Score:2)
And the T-shirts and the shoes interfere with the job exactly how? Suits (or just dress shirts) and wingtips do NOT increase efficiency one iota.
Re: (Score:2)
Re: (Score:1)
Re: (Score:2)
I'm sure Facebook, Google, and other companies where you're as likely to see a skateboard as a suit are crying into their corporate beers over whether you take them seriously. As for investment bankers, I do know someone who is pierced and tatooed and works for an Wall Street trading firm.
Of course, if we're going by dress, you really have to consider the position. The casual appearance you describe is the hallmark of the programmer... who in their right mind would hire a programmer in a suit? That'd be
Re: (Score:2)
I've seen what happens when a startup gets big, and I don't have good things to say about it.
Lack of bureaucracy is often code for the lunatics taking over and running the asylum... Think, no standards, no processes, no training for new hires (and there are, of course, lots of them) and just nobody in-charge or enforcing, anything. That kind of havock is grea
Sounds like the definition of Facebook to me (Score:1)
I mean, when was the last time something on Facebook actually worked?
WOW (Score:1)
Auto ticketed errors, I am Amazed. If you did not detect sarcasm, please enter a problem ticket. You don't think that shit's automated do you?
Upstart? (Score:1)
So this is basically a script that restarts dead daemons, right?
What's the difference between this and Upstart?
http://upstart.ubuntu.com/faq.html [ubuntu.com]
Re: (Score:1)
Re: (Score:3)
Did you even read the article? It talks about things like broken hard drives.
Google and Facebook can fail more freely (Score:2)
Part of the reason Facebook and Google can "self heal" is because failures are mostly not noticeable by end users. If a Facebook or Google machine fails, unless you are getting a 404 or a service failure message there is little to no way for you to know that the web page you have been served up is wrong, partial or out of date. This failure ambiguity provides a lot of leeway on the methods and speed required to fix a failure.
For most other services where there is a definite correct and incorrect output - li
They do it very differently (Score:3)
From the sounds of this article, Facebook and Google go about this VERY differently.
The Facebook way, it seems, is that every node in the infrastructure is possibly important. So they write and maintain all these healing scripts to deal with problems like broken processes or failed hard drives.
Google goes about the same problem in a very different way. Google's system is architected such that no node is important. Everything is massively parallel and redundant - such that you could take and destroy any serv
Self Healing, my foot (Score:1)
Re: (Score:2)
"Facebook devotes over 100 physical servers to every 35,000 users. That is incredibly inefficient"
Absolutly yes! If they only managed to serve 350 users per server, that, that would be a neat thing.