How the Leap Second Bug Led Facebook To Build DCIM Tools 46
miller60 writes "On July 1, 2012 the leap second time-handling bug caused many Linux servers to get stuck in a loop. Large data centers saw power usage spike, sometimes by megawatts. The resulting "server storm" prompted Facebook to develop new software for data center infrastructure management (DCIM) to manage its infrastructure, providing real-time data on everything from the servers to the generators. The incident also offered insights into the value of flexible power design in its server farmss, which kept the status updates flowing as the company nearly maxed out its power capacity."
Re:What triggered the bug anyway? (Score:4, Interesting)
That was the one that caused Java processes to run away and use 100% CPU, wasn't it? From what I remember, it was only in a small subset of recent kernels, and older ones were fine.