How the Leap Second Bug Led Facebook To Build DCIM Tools 46
miller60 writes "On July 1, 2012 the leap second time-handling bug caused many Linux servers to get stuck in a loop. Large data centers saw power usage spike, sometimes by megawatts. The resulting "server storm" prompted Facebook to develop new software for data center infrastructure management (DCIM) to manage its infrastructure, providing real-time data on everything from the servers to the generators. The incident also offered insights into the value of flexible power design in its server farmss, which kept the status updates flowing as the company nearly maxed out its power capacity."
DCIM (Score:5, Insightful)
System QoS (Score:3, Insightful)
How often does the leap second bug recur? If It is known to occur, then why would such platforms be relied upon instead of patching it ahead of time?
It seems to me that developing new DCIM solutions is a bit of a stretch to solve the leap second issue. Or is that just an excuse to fund new DCIM solutions (in other words, a solution in search of a problem)?
Re:DCIM (Score:4, Insightful)