Leap Second Bug Causes Crashes 230
An anonymous reader writes in with a Wired story about the problems caused by the leap second last night. "Reddit, Mozilla, and possibly many other web outfits experienced brief technical problems on Saturday evening, when software underpinning their online operations choked on the “leap second” that was added to the world’s atomic clocks. On Saturday, at midnight Greenwich Mean Time, as June turned into July, the Earth’s official time keepers held their clocks back by a single second in order to keep them in sync with the planet’s daily rotation, and according to reports from across the web, some of the net’s fundamental software platforms — including the Linux operating system and the Java application platform — were unable to cope with the extra second."
What about Windows and Mac? (Score:5, Interesting)
Re:Why now? (Score:3, Interesting)
We will keep having these kinds of issues for as long as some people who fail to understand that time of day is an arbitrary number whose main utility lies in it being composed of predictable periods and divided into homogenous units. It should have no relation whatsoever to whatever time the sun happens to rise or set at any particular location and above all it should not be changed to accomodate fluctuations in the orbit of a rock circling an arbitrary star. Abominations like leap seconds or daylight savings make the whole system less useful by merely existing.
But personally I wouldn't be surprised if people off the equator were to get summer minutes composed of 120 seconds during daytime (or even better, a scale!) to ensure the sun rises and sets at the same time year around. Or, hey, why not simply make the seconds longer? Or a combination of both plus we can define pi to be 3 to make things simpler.
I always thought leap seconds were stupid (Score:3, Interesting)
Why not bundle them and apply them every 10 or 20 years?
And apparently I'm not alone:
http://en.wikipedia.org/wiki/Leap_second#Proposal_to_abolish_leap_seconds [wikipedia.org]
Hogwash, Astronomers can find coping mechanisms, it's either that or these ridiculous levels of stress for systems admins.
Re:Linux kernel unable to cope? I think not. (Score:5, Interesting)
I run Arch Linux with kernel 3.4.4 and it went haywire. My machine was very heavily loaded at the time and when the leap second happened mysqld, firefox, and ksoftirq processes started consuming 100% CPU. The load factor was well over 10 and the machine was grinding along. It didn't actually fail but it was loaded down.
Even restarting the processes didn't fix it. The high load would go away once I stopped the processes but as soon as I started them again the load would come right back. I had Firefox open on a blank page not doing anything and it was slammed at 100% CPU and had a could ksoftirq tasks slammed at 100% CPU each too.
I had to reboot the machine to get it back to normal.
I have Ubuntu and Debian servers that for whatever reason did not add the leap second so they were fine. Their time was a second off today though (at least until ntp slowly corrected it or I manually intervened).
Only Linux affected? (Score:5, Interesting)
Google on how they fixed that.. (Score:4, Interesting)
Google official blog: "Time, technology and leaping seconds" (sept 2011)
http://googleblog.blogspot.in/2011/09/time-technology-and-leaping-seconds.html [blogspot.in]
I wonder if the leap second has anything to do with the labs Chubby paper / site currently being offline..
Re:All of my servers were fine (Score:4, Interesting)
Our problem was with a third party monitoring solution - its daemon process brought every single one of our servers to a near halt by consuming all available cpu cycles at the stroke of gmt midnight.
The OS itself was fine.
This monitoring software is common enough that it likely was behind a lot of the issues seen around the 'net.
Re:I always thought leap seconds were stupid (Score:4, Interesting)
Re:What about Windows and Mac? (Score:4, Interesting)
As far as I can tell, all current operating systems handled it fine. It's applications that have problems, mainly server-type apps that actually use the clock for important things.
Linux being heavily affected is just a side-effect of most servers running Linux (although apparently some older versions don't handle leap seconds so cleanly - maybe that has something to do with it?).
Yes, at least one of the problems appears to be a Linux kernel problem [lkml.org]. However, as that thread indicates, the consequence of this isn't a kernel crash; it causes futexes [kernel.org] to repeatedly time out (or, at least, causing futexes with timeouts to repeatedly time out). I'm guessing, perhaps incorrectly, that this might mean that code waiting for a futex gets a kernel wakeup due to a timeout, checks whether the condition being waited for has happened, discovers that it hasn't, sleeps in the futex again, gets a kernel wakeup due to a timeout, checks whether the condition being waited for has happened, discovers that it hasn't, sleeps in the futex again, lathers, rinses, repeats, so it makes no progress and chews up tons of CPU.
If so, then:
so Linux being heavily affected might also be a side-effect of, well, some versions of the Linux kernel having a bug that's triggered by leap seconds.
However, unless an application happens to use futexes in a fashion that trips over the bug, they won't be affected. It might be server applications that are most likely to do so, meaning that you might not see it on, say, a desktop or handheld Linux machine, or even on some servers.
Re: (Score:4, Interesting)
The hard system lock bug due to a leap second was patched in 2.6.29, so either you've got some weird related bug, or something is very wrong.
Well, the weird related bug would arguably count as something being wrong. Apparently there is a bug in the handling of the insertion of positive leap seconds that could cause weird behavior with [lkml.org]futexes [kernel.org], and that bug appears not to have been fixed until at least July 1, 2012 (I'm guessing John Stultz has worked up a patch [lkml.org]).