Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Facebook Bug

How the Leap Second Bug Led Facebook To Build DCIM Tools 46

miller60 writes "On July 1, 2012 the leap second time-handling bug caused many Linux servers to get stuck in a loop. Large data centers saw power usage spike, sometimes by megawatts. The resulting "server storm" prompted Facebook to develop new software for data center infrastructure management (DCIM) to manage its infrastructure, providing real-time data on everything from the servers to the generators. The incident also offered insights into the value of flexible power design in its server farmss, which kept the status updates flowing as the company nearly maxed out its power capacity."
This discussion has been archived. No new comments can be posted.

How the Leap Second Bug Led Facebook To Build DCIM Tools

Comments Filter:
  • Re:System QoS (Score:5, Informative)

    by tconnors ( 91126 ) on Tuesday August 06, 2013 @11:11PM (#44493807) Homepage Journal

    How often does the leap second bug recur?

    That one? Once. Seen plenty of different style leap second bugs (too many - leap seconds should be a relatively easy calculation, but we only get to test them once every 3 years or so, and in real time because it's kinda hard to convince a global time keeping system that a fake leap second is about to happen for testing. Still, I'd rather we fixed the software than do stupid things like get rid of UTC like some idiots are proposing), but one that causes a futex loop in java processes (and the opera web browser) just the once, and mostly only on RHEL6 and debian ~wheezy kernels at the time.

    If It is known to occur, then why would such platforms be relied upon instead of patching it ahead of time?

    The point of bugs is that they're not known to occur beforehand. This particular one was quite neat in that it wasn't the leap second code itself that was at fault, but it was the mechanism ntp used within the kernel to inform the kernel that a leapsecond was coming up. At least it didn't happen over the public holiday New Year period this time. I knew Monday was going to be a busy day in the datacentre when I saw my 3 laptops at home exhibit the problem on Sunday morning though.

    It seems to me that developing new DCIM solutions is a bit of a stretch to solve the leap second issue. Or is that just an excuse to fund new DCIM solutions (in other words, a solution in search of a problem)?

    Anything can cause a kernel or userland software to suddenly enter a hard loop burning through CPU cycles and thus power. And in a large homogenous environment, that bug can be triggered in many locations all at the one exact moment in time. Another good example might be the RHEL6 bug that affected us around the same time last year - the old "uptime has reached a hundred and something days, let's overflow a counter and kernel PANIC now!" bug. We found out about that bug after patching all of our systems, found out that it only applied to the version of the patch we managed to apply, and had to start planning to bring the next patching cycle forward (but at least we knew about it) . You'd think these were the kinds of bugs that we learnt about in 1995 and were never stupid enough to put such bugs back into the kernel, but it seems every generation must learn about it for themselves instead of reading their Operating System text books.

    The point of these bugs is that anything might cause a large fraction of your machines to start chewing through electricity. In an overprovisioned environment (VMs, power, thin storage, whatever), you want to know about them before you trip your fuses/run out of memory, fill up all your disks.

  • by DERoss ( 1919496 ) on Tuesday August 06, 2013 @11:29PM (#44493901)

    Before 1972, "leaps" were fractions of a second; a UTC second (Universal Time Coordinated) did not have the same duration as a TAI second (the French acronym for International Atomic Time); and "leaps" occurred as often as four times a year. The current form of leap-seconds has been in effect since 1972. By then, software (mostly main frames) handled leap-seconds quite easily.

    The reason for leap-seconds is that the earth's rotation is gradually slowing while many critical operations require precise time indicators. Thus, noon at Greenwich -- even average noon, which takes into account annual and semi-annual variations in the earth's rotation -- cannot be used. Instead, those critical operations use TAI. TAI is a uniform, never-varying time system while UTC is coordinated with noon at Greenwich. Since 1972, however, a UTC second has exactly the same duration as a TAI second; and a UTC clock ticks its seconds exactly at the same time as a TAI clock. If this continued indefinitely, noon on a UTC clock would gradually deviate from noon at Greenwich. Since 1972, if the deviation approaches a whole second, an extra second -- a leap-second -- is added to a UTC clock at the end of the last minute of either 30 June or 31 December.

    All this became a problem in 2006. During the 7 years from 1 January 1999 until 1 January 2006, the slowing of the earth's rotation was so slight that there were no leap-seconds. Too many young software engineers and other technologists failed to learn about leap-seconds and thus ignored them (just the the Y2K issue was ignored until it was almost too late). A situation that was handled quite well in the 1970s, 1980s, and 1990s was no longer handled at all in new systems. But on 1 January 2006, there was indeed a leap-second. By then, many of those who were familiar with leap-seconds and how to handle them had retired (including me).

Intel CPUs are not defective, they just act that way. -- Henry Spencer

Working...