Cloudflare Says It's Automated Empathy To Avoid Fixing Flaky Hardware Too Often (theregister.com) 19
The Register: Cloudflare has revealed a little about how it maintains the millions of boxes it operates around the world -- including the concept of an "error budget" that enacts "empathy embedded in automation." In a Tuesday post titled "Autonomous hardware diagnostics and recovery at scale," the internet-taming biz explains that it built fault-tolerant infrastructure that can continue operating with "little to no impact" on its services. But as explained by infrastructure engineering tech lead Jet Marsical and systems engineers Aakash Shah and Yilin Xiong, when servers did break the Data Center Operations team relied on manual processes to identify dead boxes. And those processes could take "hours for a single server alone, and [could] easily consume an engineer's entire day."
Which does not work at hyperscale. Worse, dead servers would sometimes remain powered on, costing Cloudflare money without producing anything of value. Enter Phoenix -- a tool Cloudflare created to detect broken servers and automatically initiate workflows to get them fixed. Phoenix makes a "discovery run" every thirty minutes, during which it probes up to two datacenters known to house broken boxen. That pace of discovery means Phoenix can find dead machines across Cloudflare's network in no more than three days. If it spots machines already listed for repairs, it "takes care of ensuring that the Recovery phase is executed immediately."
Which does not work at hyperscale. Worse, dead servers would sometimes remain powered on, costing Cloudflare money without producing anything of value. Enter Phoenix -- a tool Cloudflare created to detect broken servers and automatically initiate workflows to get them fixed. Phoenix makes a "discovery run" every thirty minutes, during which it probes up to two datacenters known to house broken boxen. That pace of discovery means Phoenix can find dead machines across Cloudflare's network in no more than three days. If it spots machines already listed for repairs, it "takes care of ensuring that the Recovery phase is executed immediately."
sooo... (Score:2)
They finally built Nagios.
Re: (Score:2)
but with "artificial intelligence" and "deep tech", no doubt.
Re: (Score:3, Interesting)
read the article, better yet go to the source: https://blog.cloudflare.com/au... [cloudflare.com] The summary, as per /. standards, is shitty at best. They could have said, Harry Potter waves his wand and finds issues with servers and it would have been closer to reality than what you would get from the summary. They built a self-diagnosing system, saving the engineer a lot of time.
Re: (Score:2)
I've read it, and it's... fine. It's not out of the ordinary for large scale datacenters. I'm a little weirded out that they are talking as if they recently sorted this out, I would have figured they had something like this going for a long time.
Re: sooo... (Score:2)
It is NOT a big deal or shift to monitor services on servers to see if the server is capable of actually doing work. You have been able to do it with OSS tools for decades. Big Brother, anyone? Doesn't only ping, although it does that too ofc to see if the machine is even responding before doing other tests. If they weren't already doing this, they are grossly incompetent, period.
Hopefully the real story is that they came up with a better way to do it than what they were doing before. That seems likeliest.
Re: (Score:2)
There may be some things they had to be innovative about, but it didn't show in the writeup is all I'm saying. Perhaps in the process of 'blogifying' it they had to remove the novel stuff, but at the level of detail and what is described in the text itself, it's nothing that should be novel to folks dealing with at least thousands of servers.
Which again, might be fine to highlight, as a lot of those places treat the most mundane crap as "secret sauce" and it's nice to discuss a bit more openly, it's just o
Re: (Score:2)
and when there is an network issue / lag what can (Score:2)
and when there is an network issue / lag what can trip up?
also when say jay working the backhoe cuts some fiber lines and then boxes go into Recovery phases when they don't need to or get into an Recovery loop due links from one DC to and other DC having issues.
Re: (Score:3)
also when say jay working the backhoe cuts some fiber lines
Since many cannot comment with real world experience at scale deployments, Hypothetically, fault probes are generally path aware when the path itself is not redundant. Hypothetically, a three letter data center would have multiple data lines that enter the building from hypothetically different directions with hypothetically independent routing into the area of the building, though, sometimes Telco will hypothetically lie to you about route paths. Which is why, hypothetically, an alarm would also be fault p
Re: (Score:3)
It's just over 20 years ago now, but I was involved in a case when that went wrong.
The main computers were around 9 miles away from the main offices (which were about to move) and there were two paths between the two. It was in the contract that the two paths were not permitted to use the same cables, unfortunately the two fiber lines were in the same trench. The one the backhoe cut through. That trench was even several miles away from the obvious route between the two sites.
The organisation also had two
Empathy? (Score:2)
Didn't that get abandoned six years ago?
That's called a watchdog (Score:2)
And it's as old as automated stuff that needs to be watched.
Anything new other than a new term? (Score:5, Informative)
My field of specialty is computer fault tolerance, and I've never heard of "empathy" used for fault tolerance. In fact, at least based on the linked articles, it's not even clear what "empathy" means. However, what the article describes about algorithms for probing systems and determining when to repair and when to give up sounds quite conventional. The only innovation that I can see is the invention of the term "empathy."
Re: (Score:2)
So they set a devops flag to reboot and reimage a server and mark it dead if that fails?
Maybe the script is empathy.py?
Maybe a summer intern was given the task and eight weeks?
I mean, good for them for doing it right?
"broken boxen"? (Score:2)
What year is this?
Re: (Score:2)
I don't know, but I really like those shaky-cam hyperactive skateboard videos.
Yawn (Score:2)
The latest buzzword (Score:2)
Who decided that, machine asking machine "Are you there?" is empathy?
Some CloudFlare managers must be suffering buzzword withdrawal.