How the Leap Second Bug Led Facebook To Build DCIM Tools 46
miller60 writes "On July 1, 2012 the leap second time-handling bug caused many Linux servers to get stuck in a loop. Large data centers saw power usage spike, sometimes by megawatts. The resulting "server storm" prompted Facebook to develop new software for data center infrastructure management (DCIM) to manage its infrastructure, providing real-time data on everything from the servers to the generators. The incident also offered insights into the value of flexible power design in its server farmss, which kept the status updates flowing as the company nearly maxed out its power capacity."
DCIM (Score:5, Insightful)
Re:DCIM (Score:4, Insightful)
Re: (Score:1)
The point is, make up a different acronym than one which is used ubiquitously in almost every computer related field.
Yeah, that's pretty obvious when they stated, "I hate re-used acronyms." HOWEVER, DCIM referring to data center information management is NOT a new acronym/term/concept. It has been around since the dawn of data centers... which, arguably, predate the digital camera image standards. Thus, I would argue that associating the term DCIM with digital images confuses its initial usage related to data centers.
Re: (Score:2)
"Data Center Infrastructure Management (DCIM) is an emerging (2012) form of data center management which extends the more traditional systems and network management approaches to now include the physical and asset-level components. DCIM leverages the integration of information technology (IT) and facility management disciplines to centralize monitoring, management and intelligent capacity planning of a data center's critical systems. Essentially it provides a significantly more comprehensive view of ALL of the resources within the data center."
Data centers predate digital cameras. That particular business buzzword acronym, for that particular business buzzword phrase, does not. I envision some manager looking at a DCIM dashboard somewhere with gauge images and stuff. It just seems like pretty blatant namespace pollution, even in a different domain.
Re: (Score:3)
If domain controllers had DCIM, it'd be a trifecta!
Re: (Score:2)
they could just have called this dcm.
facebook does overlap in use of dcim and dcim. they got machines doing dcim analyzing which are controlled with dcim.. and this is a story about fb using dcim.
Re: (Score:1)
I always got so confused when I tried to get cash from Adobe Type Manager.
Re:DCIM (Score:4, Funny)
Re: (Score:1)
I hate re-used acronyms.
I assume you actually hate initialisms (unless, of course, you choose to pronounce "DCIM" as "dickim," in which case I can't help you)
Re: (Score:2)
Re: (Score:2)
Re: (Score:3)
26*26*26*26 = 456976
That's basically half a million four letter combinations that companies are able to choose from, all nimbly-bimbly. Yet these assholes decide to use an existing term and mutilate the wikipedia page that was around for four and a half years [wikipedia.org] because of their arrogance
Re: (Score:1)
Re: (Score:2)
Re: (Score:1)
I hate re-used acronyms.
At work, I witnessed a heated argument about a "VMS". It took several minutes before they realized they weren't talking about the same thing !
Server farmss (Score:1)
Re: (Score:2)
They're still using Tolkien Ring networks.
Thankyouverymuch!
System QoS (Score:3, Insightful)
How often does the leap second bug recur? If It is known to occur, then why would such platforms be relied upon instead of patching it ahead of time?
It seems to me that developing new DCIM solutions is a bit of a stretch to solve the leap second issue. Or is that just an excuse to fund new DCIM solutions (in other words, a solution in search of a problem)?
Re: (Score:2)
I don't think the leap second bug recurring is the reason, but it revealed a gap in their Management abilities, by actively monitoring everything, you can have baselines, notice when something is out of the ordinary, and help pinpoint the exact cause. it's possible they thought they were patched up, and whoops... now with the new DCIM they can more accurately tell when server XYZ in datafarm B is running at 100% cpu and drawing more power than necessary. and maybe even disconnect it from the network and s
Re: (Score:2)
How often does the leap second bug recur?
When there is a leap second. I think the last time this bug was covered on slashdot, the article said it occured on a sizeable number of servers in 2012, and on several in 2013.
If It is known to occur, then why would such platforms be relied upon instead of patching it ahead of time?
Are you seriously asking why not all systems are up to date? We are talking routers, mainframes, computers running legacy software, ...
You can not just update everything. If you are a business, updates have to be tested, making sure the software still runs. And if it does not, or if your distribution is not releasing an update, you a
Re:What triggered the bug anyway? (Score:2)
When I heard last year that there may have been problems with the leap second, I checked the few Linux servers I take care of, and all seemed to be fine. They sync their time to NTP servers.
What was that problem, anyway? Or did it only affect some very busy servers? Or only in some very special circumstances? Last year's leap second wasn't anything really new either. There had been occasional leap seconds for many years. (But usually on Dec 31).
Re:What triggered the bug anyway? (Score:4, Interesting)
That was the one that caused Java processes to run away and use 100% CPU, wasn't it? From what I remember, it was only in a small subset of recent kernels, and older ones were fine.
Re:System QoS (Score:5, Informative)
That one? Once. Seen plenty of different style leap second bugs (too many - leap seconds should be a relatively easy calculation, but we only get to test them once every 3 years or so, and in real time because it's kinda hard to convince a global time keeping system that a fake leap second is about to happen for testing. Still, I'd rather we fixed the software than do stupid things like get rid of UTC like some idiots are proposing), but one that causes a futex loop in java processes (and the opera web browser) just the once, and mostly only on RHEL6 and debian ~wheezy kernels at the time.
The point of bugs is that they're not known to occur beforehand. This particular one was quite neat in that it wasn't the leap second code itself that was at fault, but it was the mechanism ntp used within the kernel to inform the kernel that a leapsecond was coming up. At least it didn't happen over the public holiday New Year period this time. I knew Monday was going to be a busy day in the datacentre when I saw my 3 laptops at home exhibit the problem on Sunday morning though.
Anything can cause a kernel or userland software to suddenly enter a hard loop burning through CPU cycles and thus power. And in a large homogenous environment, that bug can be triggered in many locations all at the one exact moment in time. Another good example might be the RHEL6 bug that affected us around the same time last year - the old "uptime has reached a hundred and something days, let's overflow a counter and kernel PANIC now!" bug. We found out about that bug after patching all of our systems, found out that it only applied to the version of the patch we managed to apply, and had to start planning to bring the next patching cycle forward (but at least we knew about it) . You'd think these were the kinds of bugs that we learnt about in 1995 and were never stupid enough to put such bugs back into the kernel, but it seems every generation must learn about it for themselves instead of reading their Operating System text books.
The point of these bugs is that anything might cause a large fraction of your machines to start chewing through electricity. In an overprovisioned environment (VMs, power, thin storage, whatever), you want to know about them before you trip your fuses/run out of memory, fill up all your disks.
an adticle from Facebook, this time? (Score:2)
I don't get the point here? What is Facebook doing that's new for a datacentre?
Why does my camera... (Score:2)
... have Data Center Infrastructure Management? At least now I know what the name of that subfolder means. Is this another NSA thing, is the NSA or Facebook snarfing my photos right off the camera?
Leap Seconds Are Old News (Score:5, Informative)
Before 1972, "leaps" were fractions of a second; a UTC second (Universal Time Coordinated) did not have the same duration as a TAI second (the French acronym for International Atomic Time); and "leaps" occurred as often as four times a year. The current form of leap-seconds has been in effect since 1972. By then, software (mostly main frames) handled leap-seconds quite easily.
The reason for leap-seconds is that the earth's rotation is gradually slowing while many critical operations require precise time indicators. Thus, noon at Greenwich -- even average noon, which takes into account annual and semi-annual variations in the earth's rotation -- cannot be used. Instead, those critical operations use TAI. TAI is a uniform, never-varying time system while UTC is coordinated with noon at Greenwich. Since 1972, however, a UTC second has exactly the same duration as a TAI second; and a UTC clock ticks its seconds exactly at the same time as a TAI clock. If this continued indefinitely, noon on a UTC clock would gradually deviate from noon at Greenwich. Since 1972, if the deviation approaches a whole second, an extra second -- a leap-second -- is added to a UTC clock at the end of the last minute of either 30 June or 31 December.
All this became a problem in 2006. During the 7 years from 1 January 1999 until 1 January 2006, the slowing of the earth's rotation was so slight that there were no leap-seconds. Too many young software engineers and other technologists failed to learn about leap-seconds and thus ignored them (just the the Y2K issue was ignored until it was almost too late). A situation that was handled quite well in the 1970s, 1980s, and 1990s was no longer handled at all in new systems. But on 1 January 2006, there was indeed a leap-second. By then, many of those who were familiar with leap-seconds and how to handle them had retired (including me).
Re: (Score:3)
I understand some secure protocols need accurate global and difficult to forge time. But outside that? I mean so what if the time on a wall post is out by a second?
Re: (Score:2)
That's fine. Zuckerberg thinks it's a good thing to "break stuff" and uses that as a slogan.
facebook and digital cameras images? (Score:3)
The filesystem in a digital camera contains a DCIM (Digital Camera IMages) directory.
Can y'all stop re-using abbreviations, please.
Google did this with NTP "leap smear" (Score:1)
Time, technology and leaping seconds [blogspot.com]