1 In 3 Data Center Servers Is a Zombie 107
dcblogs writes with these snippets from a ComputerWorld story about a study that says nearly a third of all data-center servers are are comatose ("using energy but delivering no useful information"). What's remarkable is this percentage hasn't changed since 2008, when a separate study showed the same thing. ... A server is considered comatose if it hasn't done anything for at least six months. The high number of such servers "is a massive indictment of how data centers are managed and operated," said Jonathan Koomey, a research fellow at Stanford University, who has done data center energy research for the U.S. Environmental Protection Agency. "It's not a technical issue as much as a management issue."
Money (Score:5, Insightful)
Re: (Score:1, Insightful)
Money (or lack of it) IS a management issue....
But how hard is it to automate a process that says, in effect, "if no data is going in or out of this server, shut it down"? I suspect that there is a more nefarious purpose here and I propose a corollary to Hanlon's (Heinlein's) Razor:
This is the 21st Century - "You have attributed conditions to villainy that simply result from villainy". Incompetence is for the proletariat - we're the NSA. You're toast.
Re:Money (Score:5, Insightful)
Why should the data center even care.
Most of them are essentially charging rent ... as long as the customer keeps paying, WTF do they care if you actually use them for anything?
This isn't incompetence on behalf of the data centers. Maybe companies who have machines they've lost track of what they're for.
Re: (Score:3)
Depends if you can virtualise then you can over provision. I'd love having multiple people pay rent for the same system.
Re: Money (Score:1)
You can't access your server?
Oh, we shut it down because it didn't receive any connections for a week.
Re: (Score:3)
Money (or lack of it) IS a management issue....
But how hard is it to automate a process that says, in effect, "if no data is going in or out of this server, shut it down"? I suspect that there is a more nefarious purpose here and I propose a corollary to Hanlon's (Heinlein's) Razor:
This is the 21st Century - "You have attributed conditions to villainy that simply result from villainy". Incompetence is for the proletariat - we're the NSA. You're toast.
If a customer is paying for it to be there and be kept turned on *maybe* that customer has some use for the server oh I don't know maybe its a hot spare in case another server in another data center goes down? So you turn it off, their other server goes down, their service can't fail over and now your customer has a problem.
Re: (Score:2)
Re: (Score:2)
Where I work, electricity is 0,25ct/kWh and a specialist in IT or law costs 1.000,- EUR/d or more.
Assuming a server we're planning to shut down is rather old, they usually are, so it will probably fail on its own within 3 years, if not much sooner. It is not doing much anymore, so it's sitting at idle, drawing only idle loads. Assuming the idle load of an old server is 100W, how much specialist's time can we allocate to shutting it down?
100W * 8760 h/y * 5y = 2.628 kWh. This will cost us about 657,- EUR or
Re: (Score:2)
You failed to account for the system admin time to keep the server patched and secure. Also you assume that everyone is renting rack space and it is infinite in supply.
These constraints mean that in my experience when a box is no longer doing anything useful it gets issued with a shutdown command to save the power. At this point if it really is required and a user somewhere starts shouting I can power it back up in a couple of minutes.
Then generally six to 12 months later it gets removed from the rack becau
Re: (Score:2)
You failed to account for the system admin time to keep the server patched and secure. Also you assume that everyone is renting rack space and it is infinite in supply.
either that or it's running Linux with zero licensing costs to stir management.
Re: (Score:3)
At this point for almost all companies good quality colo space is infinite. Most times a company isn't even using a meaningful fraction of their colo's space and so they could double or triple instantly without hassle much less an extra 33%. And even if their colo doesn't other's direct connected to it do have extra space... So consider space infinite once you are willing to rent.
That being said, I have problems believing the 1/3rd of severs figures from the article. That's not my experience at all.
Re: (Score:2)
Some people do not manage remove servers over long periods.
You install three identical servers: one running the public facing web server, one running the database server, connected by a separate, private network. The third one is available for the new version of the software to be installed, and then activated. Once the software is upgraded on all three, you keep it runnning as a hot standby. If reliable service to clients is not worth more than the cost o
Re: (Score:2)
It's a cheap energy problem. If energy were more expansive it would be worth it to take them offline.
Re:Zombies or fail over? (Score:5, Informative)
A fail over server is not considered useless. They did not monitor server output and decided then after a period of time that the server were not doing anything. You can infer this knowledge by reading the "paper", as they switched these servers off after identifying them. Switching of fail over servers normally would raise alarms and then you get thrown out ;-) So you could safely assume that they mean unused servers.
Re: (Score:1)
A fail over server is not considered useless.
You'd better not move into management. Backups are expensive. Expensive is bad (unless it refers to something on management's expense account). Let's get with the program. As I mentioned earlier, money is management. And management needs the money. You don't.
Re:Zombies or fail over? (Score:5, Informative)
I've been in IT Management for 15+ and I can assure you it is a good thing you are not in management. I would lose my job in a heartbeat if production server decided to take a dump and I had shut off all our fail-over servers.
It's not just a matter of what those fail-over servers costs. It's the question "Can we afford (financially) to NOT have fail-over servers?". If you stand to lose more due to a production server failure than the cost of running a fail-over for a year then you will not EVER wish to be caught without one.
Re: (Score:3)
I've been in IT Management for 15+ and I can assure you it is a good thing you are not in management. I would lose my job in a heartbeat if production server decided to take a dump and I had shut off all our fail-over servers.
It's not just a matter of what those fail-over servers costs. It's the question "Can we afford (financially) to NOT have fail-over servers?". If you stand to lose more due to a production server failure than the cost of running a fail-over for a year then you will not EVER wish to be caught without one.
How is it a failover server if no data has traveled into or out of the machine in six months? Wouldn't you want to keep a failover server up to date (data and software updates) so you don't notice the failover? What good is a failover server if you have to load six months of data from tape? The machine could be off until you need it in that case.
Re: (Score:3)
wrong, you don't understand how it's usually done these days
it only need have the ability to access a SAN where replicated information from the primary server exists
you will not see any data movement to the machine
Re: (Score:1)
Re:Zombies or fail over? (Score:4, Informative)
yes, but these researchers were ignoring traffic below a certain threshold.
Re: (Score:2)
No, in the big banks, it's the disk servers that do the mirroring themselves, not the application servers. Except for software updates and configuration changes, the application servers just sit idle at the backup site.
Re: (Score:2)
You're a particularly special kind of "stupid", aren't you?
The disk servers are mirroring to the backup disk servers, obviously. And I used the term "disk server" because there are several vendors and brands of products available that do the same job.
Re: (Score:2)
I can see some machines snoozing for long periods of time, but not 1/3 of a place:
1: Hypervisor-level failover on VMWare or Hyper-V. Generally there are hypervisor updates, such as the recent SSL holes which required a update on ESXi, and other security items on Hyper-V [1]. However, these can sit for a good while untouched, and ready to handle a vMotion punt at a moment's notice.
2: Failure on an active/passive configuration at the DB level. With something like Oracle RAC that costs a lot for licensing
Re: (Score:2)
Grr, quick addendum on #2: I can see a firm just tossing the OS and application on a machine and walking off, but in general that isn't a good practice.
Re: (Score:2)
I've been in IT Management for 15+ and I can assure you it is a good thing you are not in management. I would lose my job in a heartbeat if production server decided to take a dump and I had shut off all our fail-over servers.
It's not just a matter of what those fail-over servers costs. It's the question "Can we afford (financially) to NOT have fail-over servers?". If you stand to lose more due to a production server failure than the cost of running a fail-over for a year then you will not EVER wish to be caught without one.
I'm with you in general but it can be incredibly difficult to get an estimate from business intelligence on how much you actually stand to lose per hour of downtime.
Re: (Score:1)
Re: Zombies or fail over? (Score:2)
Well if it is broken, fix it. In your case the system was intended to be a redundant server, however, it did not provide real redundancy.
Chaos Monkey by Netflix (Score:3)
I was under the impression that a fail-over server that does not occasionally handle traffic in periodic tests could not be trusted to handle traffic in a true failure situation. Netflix routinely conducts tests of its failover infrastructure, shutting down large blocks of its leased Amazon capacity [arstechnica.com] to make sure the rest of its capacity can keep up.
Re: (Score:2)
take your competency and get out of this discussion.
Re: (Score:1)
No you get to keep the rack/rack units/cage space once you have acquired it as long as you pay your bill.
Re: (Score:2)
So you leave zombies on the wire, staking your claim. Then when you need the space, you swap it. Otherwise it's an endless wait for power, cooling, CAB, governance, and all sorts of fail.
No you get to keep the rack/rack units/cage space once you have acquired it as long as you pay your bill.
In a co-location environment, yes. In a standard business environment, the GP's response is true.
Of course, the article's definition of "useful" might not be a sysadmin's definition of "useful". Redundant machines, backup machines, extra capacity machines, dev machines, test machines, support machines, etc. all might be considered non-useful to customers, sales department, HR, or the CEO.
Yes, it's called redundancy (Score:1)
We need enough servers for peak load, not average load.
Re: (Score:2)
True, but in that case these machines do something sometimes over the year. In a modern data center you would be able to shutdown the servers not used for a longer period and restart them automatically when the load rises. A hardware server start may take ten minutes (if there is not much to synchronize), but as you should know your load profile and use load estimation techniques, you can start the servers in advance. Especially, in context of replication of JVM and .Net components, this should be pretty ea
Re:Yes, it's called redundancy (Score:5, Informative)
In a modern data center you would be able to shutdown the servers not used for a longer period and restart them automatically when the load rises.
Many businesses that rely on servers (i.e. all of them) will be running hot standby systems - ones that can automatically take load if there's a hardware failure or software problem.
One major (world-ranked) international company I consulted at was legally required to have 100% failover capacity - so it was inevitable that they would automatically have 50% of their production servers performing no functions - except for the twice a year when they were "flipped" just to make sure that each set of servers worked as expected.
Although the source paper does specify physical "zombie" servers, if you need failover VMs, the same basis is applied there, too.
Re: (Score:2)
One major (world-ranked) international company I consulted at was legally required to have 100% failover capacity - so it was inevitable that they would automatically have 50% of their production servers performing no functions - except for the twice a year when they were "flipped" just to make sure that each set of servers worked as expected.
Why flip them twice a year and not, say, weekly?
Re: (Score:2)
Re: (Score:1)
Because doing it right involves a full fail-over test including transferring loads or test loads, DNS auto-reconfiguration, and possibly even paying extra to bring up extra capacity elsewhere. You need to make sure it happens right when it's needed. Extra paperwork, overtime, it's all in there.
If the system is architected well, shouldn't all of those steps be automated... including monitoring and failover success/failure?
Re: (Score:2)
I can imagine that this wouldn't be perfectly smooth. It may be automated but it may not be completely bumpless and I don't think a company would be happy if users see a "scheduled maintenance" sign for 15 min or however long it takes every week.
Re: (Score:1)
I fear you may be right, and that's exactly why they don't do it more often... but I think that also underscores my point a bit. Shouldn't they work to get it to the point where users won't be impacted?
Netflix does this pretty aggressively [pagerduty.com] and users don't seem to notice. Though I realize for most companies I am being very idealistic.
Re: (Score:2)
If the system is architected well, shouldn't all of those steps be automated... including monitoring and failover success/failure?
In a perfect world, with perfect systems documentation you'd be right. Unfortunately few of us have the pleasure of working in such an environment :)
Re: (Score:3)
A hardware server start may take ten minutes - if it actually comes up successfully. If you are starting a cluster in an emergency outage, you never know how many servers, power supplies and network switches kicked the bucket since you last used them. Plus, your DNS, NFS, db and other dependencies have to be unaffected by the outage and handle the added load of hundreds of servers starting at the same time. If you do a staggered restart of 100 servers in groups of 10, that's an hour and 40 minutes of outage
Re: Yes, it's called redundancy (Score:2)
You are absolutely right . If these server provide fail over then they must be present. The server start stop thing only applies to load management, e.g. for web shops. As I stated earlier (maybe it was in another post), fail over server as all redundancy related infrastructure are not useless. They serve a purpose. Therefore they cannot be stopped without getting into trouble.
Re: (Score:2)
Some servers (IBMs, HP ProLiants) have decent power management capabilities, so the boxes can stay on and be idle... but consume a relatively small amount of electricity and cooling. Add a SSD for local storage and swap (start the OS or hypervisor and let the SAN take it from there), and even the energy usage of spinning disks can be minimized.
However, with the many ways and layers to do HA, might as well do active/active if possible. On the VMWare side of the house, DRS comes to mind, and it also support
Re: (Score:2)
In our case, about 20% of our servers are outdated and not kept as well maintained, as they used to host some important service, but their new replacement was built and that service was migrated, but nobody's 100% sure if there were any other latent, less important services running on that machine. So it stays on because everybody has more important things to do than find out what else is running on there, and perhaps more importantly, nobody wants to be the guy who shuts down the server that's still runnin
Sounds about right. (Score:2, Insightful)
One in three people consumes energy and produces nothing interesting.
Re: Sounds about right. (Score:2, Funny)
Like this comment.
Crap, now it's 2 out of 3.
Re: (Score:2)
3 for 3.
One for all, and all for one!
Why is this article (in general) ruffling so many feathers? Because it is a thinly-disguised Malthusian Energy hit-piece specifically targeted at the center of IT's most sacred golden calf, the cloud server industry. The reason that the assumptions made in this study are confusing to many (as in, why are we even on this page? Isn't an overall one-third quiescent portion a sign of a properly engineered critical system?) is that it was not motivated by intelligent resource
Re: (Score:2)
"Decease" usually implies that the subject didn't survive.
Bad Title (Score:5, Informative)
Re: (Score:2)
made you click..
Re: (Score:2)
3 page "paper" not all that insightful. (Score:1)
Apparently, the researchers have never heard of business continuity planning. If your primary data center gets knocked offline because your company located it in a hurricane-prone area of the country in order to take advantage of state tax breaks and a cheaper labor force (happens all the time), then you're gonna need another site you can switch over your data/voice traffic instantly when the inevitable hurricane hits. That means maintaining a certain amount of redundant equipment at the failover site that
Re: (Score:2)
Moreover the idle power of systems vs. under normal load can be three to one.
Besides failover there are "swing" servers where virtual machines or services are migrated while upgrades done elsewhere. There are "staging" servers that become busy while new software being rolled out but might otherwise be idle for months.
Note the power draw of an idle server can be a third or less what the normal load is.
The twats that wrote this paper obviously aren't in the business.
Obviously (Score:5, Insightful)
Those are the servers hosting Slashdot's new "share" button. No one's ever clicked on it.
Re: (Score:3)
They are not consuming 30% of power (Score:4, Insightful)
Modern systems are good at reducing power consumption when idle. It's quite reasonable to have 30% of capacity as spares, reserve for unexpected load, capacity for new apps and so on. They probably consume 3% of the power and nobody is motivated enough to look for more savings. Keeping things completely off is problematic, because you never know how much of the hardware and software will come up in time to handle an emergency unless you run and test it all the time.
There is certainly room for further environmental/financial improvement, but the 30% figure is sensationalized.
Re: (Score:3)
Maybe. But on the other hand, even active servers spend a lot of their time idle (the paper says server utilization "rarely exceeds 6%"), and I bet a lot of these "comatose" servers are actually long-forgotten old hardware, or machines that nobody can be bothered to decommission -- it's possible that on average they're older than active servers and thus eating a lot more power.
Re: (Score:1)
What are their metrics for being a zombie? (Score:2)
How do they judge whether or not a server is contributing useful information? I have two person VPSs out there that do almost nothing on the public internet. They mostly act as a place where I can store data as a form of backup, but also a place I can access when I need it to test programs, get a really fast download, etc. But most of the time these vps's just act as central nodes in my private VPN. So by their definition are my servers in the 1/3 "zombie" serviers? I pay the rent, so to speak, so I'm
"massive indictment" (Score:2)
... of purple prose.
The mere existence of servers on standby is not a problem, let alone a "massive" one.
Pretty close to... (Score:2)
Bad terminology (Score:5, Insightful)
Unfortunate confuse of terminology. Zombie computers is a term also used to mean those taken over by bot nets.
Re: (Score:1)
Yes, that's exactly what I thought when I read the title. And I sense it was on purpose. Why, otherwise, use "comatose" everywhere else but the title?
So, an average 1.33 safety factor? (Score:3)
Re: (Score:2)
Re: (Score:2)
Try making stuff that goes on ships
Try making stuff that goes on aircraft...
Aircraft and spaceships have weight restrictions. Weight (and often volume) on a ship are not important. So you can literally have 2 of every system in many cases.
A story (Score:2)
Back when I was a sysadmin for a government department I had been assigned a couple of chassis of HP blades that were bought in one of the famous fiscal year end splurges. For the most part I had no use for them and I didn't even install Linux on them. I think I only ever used a couple of blades and I hated them. It was the first generation and they ran very hot and we had lots of issues with bad RAM. The other three chassis on the rack belonged to the VMWare team and were in heavy use.
Since I had no n
Isn't this just on demand processing? (Score:2)
Regulations and data retention (Score:2)
I know the industry I'm in, we have regulations which require 3+ years of data retention which "isn't providing anything useful" until it is. If we have a legal "issue" then that will extend until the legal issue goes away and the judge says we can destroy data. While we can use archive methods, sometimes the live system is really what is needed to retrieve data. It's better to just keep disks spinning than shut them down and hope they spin back up.
IT has a long tail where I work. Things are planning to
Clickbait (Score:1)
Re: (Score:2)
Sometimes the hosting company doesn't keep track of it either. I recently was involved in a datacenter decommission (20yo) and at the end of the day a shit ton of hardware was still happily humming along, nobody claimed it nor did anyone keep track of whose it was (most likely they once did but moved asset tracking platforms which missed certain things).
About a decade ago another company I worked for had a similar thing where we expanded the datacenter and started keeping accurate track of new assets. Again
Causes of hording. (Score:2)
From personal experience, the bureaucracy of our org makes it that procurement of servers is so difficult that section managers tend to horde them when they get them.
I'm hoping virtualization will improve this situation, but something tells me it will only create different problems. The bureaucratic culture usually invents new ways to foul up new tools.
Re: (Score:2)
Re: (Score:2)
One way to handle that is to not own your infrastructure and just rent month to month from the vendor who provides a pool of servers. What you are likely facing is the problem of how to prevent the administrative cost from going above X% by preventing the IT administrative cost from going about Y% by slowing down acquisitions... Better yet is just to guarantee Y and save the labor.
Re: (Score:1)
For security reasons, the org in question wants mostly internal servers. But if they ran it kind of like a vendor, it may work in that that each section has to pay for any server instance it's using, through the budgeting process. But, the org in question would probably bungle that too.
Re: (Score:2)
The department of defense runs servers out of house. Lockheed Martin runs a cloud provider. Many of the country's banks handle it. There is no question you can buy better security than any company has internally.
As for running an internal cloud that's pretty easy and they could ask a vendor to run the financial it while keeping all the servers physically on their prem.
turn them into mail servers ... (Score:3)
Yahoo in Northern VA (Score:2)
There's this rumor that when Yahoo expanded its Lockport "chicken coop" data centers in upstate NY they vacated at least two large data centers in Northern VA and because the lease isn't up for another two years they have been mostly empty ever since.
Yet, Yahoo is saving lots of money by doing this.
Sorry, but that sounds like ignorance (Score:2)
You do not spec for "average" usage; you spec for *max*. You also have to spec for how many machines (when we're talking about thousands, or tens of thousands of servers) are going to fail today, to be picked up by the "zombie" machines that are, in fact, hot spares.
And then there's the Big Events, like the shooting in Charleston, or when the SCOTUS announces about gay marriage or the ACA - how many of those "zombie" machines are going to go live to help carry the traffic load?
Before IPMI... (Score:1)
Re: (Score:2)