Forgot your password?
typodupeerror
Facebook Bug

How the Leap Second Bug Led Facebook To Build DCIM Tools 46

Posted by Soulskill
from the spacetime-creates-the-best-bugs dept.
miller60 writes "On July 1, 2012 the leap second time-handling bug caused many Linux servers to get stuck in a loop. Large data centers saw power usage spike, sometimes by megawatts. The resulting "server storm" prompted Facebook to develop new software for data center infrastructure management (DCIM) to manage its infrastructure, providing real-time data on everything from the servers to the generators. The incident also offered insights into the value of flexible power design in its server farmss, which kept the status updates flowing as the company nearly maxed out its power capacity."
This discussion has been archived. No new comments can be posted.

How the Leap Second Bug Led Facebook To Build DCIM Tools

Comments Filter:
  • DCIM (Score:5, Insightful)

    by AK Marc (707885) on Tuesday August 06, 2013 @06:46PM (#44492089)
    My digital camera already has DCIM tools (as does the computer I plug it in to). I hate re-used acronyms.
    • by Anonymous Coward

      I always got so confused when I tried to get cash from Adobe Type Manager.

    • by Anonymous Coward

      I hate re-used acronyms.

      I assume you actually hate initialisms (unless, of course, you choose to pronounce "DCIM" as "dickim," in which case I can't help you)

    • by Culture20 (968837)
      Not me, I used to use a KVM to switch between my two KVM hosts.
    • by mewsenews (251487)

      26*26*26*26 = 456976

      That's basically half a million four letter combinations that companies are able to choose from, all nimbly-bimbly. Yet these assholes decide to use an existing term and mutilate the wikipedia page that was around for four and a half years [wikipedia.org] because of their arrogance

      • by oodaloop (1229816)
        Yeah, it's hard to imagine why they didn't use ZZXQ, KKKZ, or OOOO. Those are perfectly good acronyms!
        • by AK Marc (707885)
          Sounds like airport codes for new airports. All the "good" ones are taken, so the small and newer ones get letters that don't match the human name for it. George Bush Intercontinental Airport: IAH (likely a hold over for Intercontinental/International Airport of Houston, which was never its name)
    • by GuB-42 (2483988)

      I hate re-used acronyms.

      At work, I witnessed a heated argument about a "VMS". It took several minutes before they realized they weren't talking about the same thing !

  • by Anonymous Coward
    Managed by Gollumses.
  • System QoS (Score:3, Insightful)

    by atom1c (2868995) on Tuesday August 06, 2013 @06:47PM (#44492103)

    How often does the leap second bug recur? If It is known to occur, then why would such platforms be relied upon instead of patching it ahead of time?

    It seems to me that developing new DCIM solutions is a bit of a stretch to solve the leap second issue. Or is that just an excuse to fund new DCIM solutions (in other words, a solution in search of a problem)?

    • I don't think the leap second bug recurring is the reason, but it revealed a gap in their Management abilities, by actively monitoring everything, you can have baselines, notice when something is out of the ordinary, and help pinpoint the exact cause. it's possible they thought they were patched up, and whoops... now with the new DCIM they can more accurately tell when server XYZ in datafarm B is running at 100% cpu and drawing more power than necessary. and maybe even disconnect it from the network and s

    • How often does the leap second bug recur?

      When there is a leap second. I think the last time this bug was covered on slashdot, the article said it occured on a sizeable number of servers in 2012, and on several in 2013.

      If It is known to occur, then why would such platforms be relied upon instead of patching it ahead of time?

      Are you seriously asking why not all systems are up to date? We are talking routers, mainframes, computers running legacy software, ...
      You can not just update everything. If you are a business, updates have to be tested, making sure the software still runs. And if it does not, or if your distribution is not releasing an update, you a

    • When I heard last year that there may have been problems with the leap second, I checked the few Linux servers I take care of, and all seemed to be fine. They sync their time to NTP servers.

      What was that problem, anyway? Or did it only affect some very busy servers? Or only in some very special circumstances? Last year's leap second wasn't anything really new either. There had been occasional leap seconds for many years. (But usually on Dec 31).

    • Re:System QoS (Score:5, Informative)

      by tconnors (91126) on Tuesday August 06, 2013 @11:11PM (#44493807) Homepage Journal

      How often does the leap second bug recur?

      That one? Once. Seen plenty of different style leap second bugs (too many - leap seconds should be a relatively easy calculation, but we only get to test them once every 3 years or so, and in real time because it's kinda hard to convince a global time keeping system that a fake leap second is about to happen for testing. Still, I'd rather we fixed the software than do stupid things like get rid of UTC like some idiots are proposing), but one that causes a futex loop in java processes (and the opera web browser) just the once, and mostly only on RHEL6 and debian ~wheezy kernels at the time.

      If It is known to occur, then why would such platforms be relied upon instead of patching it ahead of time?

      The point of bugs is that they're not known to occur beforehand. This particular one was quite neat in that it wasn't the leap second code itself that was at fault, but it was the mechanism ntp used within the kernel to inform the kernel that a leapsecond was coming up. At least it didn't happen over the public holiday New Year period this time. I knew Monday was going to be a busy day in the datacentre when I saw my 3 laptops at home exhibit the problem on Sunday morning though.

      It seems to me that developing new DCIM solutions is a bit of a stretch to solve the leap second issue. Or is that just an excuse to fund new DCIM solutions (in other words, a solution in search of a problem)?

      Anything can cause a kernel or userland software to suddenly enter a hard loop burning through CPU cycles and thus power. And in a large homogenous environment, that bug can be triggered in many locations all at the one exact moment in time. Another good example might be the RHEL6 bug that affected us around the same time last year - the old "uptime has reached a hundred and something days, let's overflow a counter and kernel PANIC now!" bug. We found out about that bug after patching all of our systems, found out that it only applied to the version of the patch we managed to apply, and had to start planning to bring the next patching cycle forward (but at least we knew about it) . You'd think these were the kinds of bugs that we learnt about in 1995 and were never stupid enough to put such bugs back into the kernel, but it seems every generation must learn about it for themselves instead of reading their Operating System text books.

      The point of these bugs is that anything might cause a large fraction of your machines to start chewing through electricity. In an overprovisioned environment (VMs, power, thin storage, whatever), you want to know about them before you trip your fuses/run out of memory, fill up all your disks.

  • I don't get the point here? What is Facebook doing that's new for a datacentre?

  • ... have Data Center Infrastructure Management? At least now I know what the name of that subfolder means. Is this another NSA thing, is the NSA or Facebook snarfing my photos right off the camera?

  • by DERoss (1919496) on Tuesday August 06, 2013 @11:29PM (#44493901)

    Before 1972, "leaps" were fractions of a second; a UTC second (Universal Time Coordinated) did not have the same duration as a TAI second (the French acronym for International Atomic Time); and "leaps" occurred as often as four times a year. The current form of leap-seconds has been in effect since 1972. By then, software (mostly main frames) handled leap-seconds quite easily.

    The reason for leap-seconds is that the earth's rotation is gradually slowing while many critical operations require precise time indicators. Thus, noon at Greenwich -- even average noon, which takes into account annual and semi-annual variations in the earth's rotation -- cannot be used. Instead, those critical operations use TAI. TAI is a uniform, never-varying time system while UTC is coordinated with noon at Greenwich. Since 1972, however, a UTC second has exactly the same duration as a TAI second; and a UTC clock ticks its seconds exactly at the same time as a TAI clock. If this continued indefinitely, noon on a UTC clock would gradually deviate from noon at Greenwich. Since 1972, if the deviation approaches a whole second, an extra second -- a leap-second -- is added to a UTC clock at the end of the last minute of either 30 June or 31 December.

    All this became a problem in 2006. During the 7 years from 1 January 1999 until 1 January 2006, the slowing of the earth's rotation was so slight that there were no leap-seconds. Too many young software engineers and other technologists failed to learn about leap-seconds and thus ignored them (just the the Y2K issue was ignored until it was almost too late). A situation that was handled quite well in the 1970s, 1980s, and 1990s was no longer handled at all in new systems. But on 1 January 2006, there was indeed a leap-second. By then, many of those who were familiar with leap-seconds and how to handle them had retired (including me).

    • by delt0r (999393)
      What i don't get is what really breaks with a second error or jump. I often just do the ntpdate thing and my clocks are shifted a lot more than a second. As long as i am not compiling, i havn't had any issues.

      I understand some secure protocols need accurate global and difficult to forge time. But outside that? I mean so what if the time on a wall post is out by a second?
    • by mattack2 (1165421)

      Too many young software engineers and other technologists failed to learn about leap-seconds and thus ignored them

      That's fine. Zuckerberg thinks it's a good thing to "break stuff" and uses that as a slogan.

  • by MadMaverick9 (1470565) on Wednesday August 07, 2013 @07:09AM (#44495629)

    The filesystem in a digital camera contains a DCIM (Digital Camera IMages) directory.

    Can y'all stop re-using abbreviations, please.

  • by Anonymous Coward

    Time, technology and leaping seconds [blogspot.com]

    The solution we came up with came to be known as the "leap smear." We modified our internal NTP servers to gradually add a couple of milliseconds to every update, varying over a time window before the moment when the leap second actually happens. This meant that when it became time to add an extra second at midnight, our clocks had already taken this into account, by skewing the time over the course of the day. All of our servers were then able to continue as normal with

The first version always gets thrown away.

Working...