Azure Failure Was a Leap Year Glitch 247
judgecorp writes "Microsoft's Windows Azure cloud service was down much of yesterday, and the cause was a leap year bug as the service failed to handle the 29th day of February. Faults propagated making this a severe outage for many customers, including the UK Government's recently launched G-cloud service."
Who could have foreseen a leap year coming? (Score:5, Funny)
Seriously, if my American high school education taught me nothing else, it was that those things only come along like every 100 years or something.
Re:TCO TIC (Score:5, Funny)
Obviously you didn't inform yourself with the very helpful and informative "Get The Facts" materials Microsoft provided us with a few years ago. If you had you would know how much higher the TCO of Linux on the server is even after a massive outage.
Re: (Score:2)
Re:Who could have foreseen a leap year coming? (Score:5, Funny)
In all fairness, Microsoft never figured anyone would still be using this service by the time a leap year rolled around.
Re: (Score:3)
Re:Who could have foreseen a leap year coming? (Score:5, Funny)
In all fairness, Microsoft never figured anyone would still be using this service by the time a leap year rolled around.
I work on the Azure team and I can confirm this.
Re:Who could have foreseen a leap year coming? (Score:5, Funny)
Microsoft has solved the problem and applied a patch to their systems.
The new patch is anticipated to keep the service up and stable for least 4 years.
-
Re:Who could have foreseen a leap year coming? (Score:4, Interesting)
In all fairness, Microsoft never figured anyone would still be using this service by the time a leap year rolled around.
Ah, that explains why Zunes went dark on New-Years 2009... [pcworld.com]
Think about this. You're a software dev, and you use a MS C++ compiler. They wrote their standard libs, including the "time.h" / <ctime> code... you use their time libraries.
Now two things:
0. MS employs some real nut-jobs that can't even use the standard time functions and instead write their own for each project...
or
1. MS doesn't even trust their own compiler / libraries to do the right thing?
It scares me to think that MS makes operating systems... IMHO, they should get back to BASICs.
Re: (Score:2)
I thought that was Halley's comet?
Re:Who could have foreseen a leap year coming? (Score:5, Funny)
Save us, Captain Obvious! *swoons* :P
Re:Who could have foreseen a leap year coming? (Score:5, Insightful)
Actually every hundred year is when a leap year doesn't come along. (unless it's divisible by 400, then it does)
Right; and I wonder how many computer failures will happen on the first of March, 2100, due to part of the software thinking it's the 29th of February, causing random problems while talking to other software that knows the correct date.
We all know it's gonna happen ...
Same Story / Different Day (Score:5, Funny)
Re:Same Story / Different Day (Score:5, Insightful)
Re:Same Story / Different Day (Score:4, Funny)
Re:Same Story / Different Day (Score:5, Interesting)
What is with MS and their apparent inability to cope with leap years?
Re: (Score:3)
I would like to know the same thing. This seems to be systemic. Is there something inherently confusing or flawed in the way Microsoft approaches elapsed-time calculations? Is it due to the internal representation of time that they use, or is there some reason that developers on their platform are doing calculations using date strings, or is it something else?
Re: (Score:2)
Re: (Score:3)
Re:Same Story / Different Day (Score:5, Insightful)
I had a similar thought about code reuse, but an entirely different conclusion: I thought that they weren't re-using good code since the same problem has cropped up at least two times. That sounds more like a case of re-rolling things that definitely shouldn't be re-rolled (date/time handling) to me.
In either event, they're not using particularly good practices. Either they are constantly reinventing the wheel and apparently in error-prone ways, or they are re-using code but paying no attention to keeping that external code up to date.
The only other thing I can think of is that Azure is somehow so drastically different than anything else they have ever done that they had to do the code again from scratch -- which is probably a problem all by itself.
Re: (Score:3)
I would like to know the same thing. This seems to be systemic. Is there something inherently confusing or flawed in the way Microsoft approaches elapsed-time calculations? Is it due to the internal representation of time that they use, or is there some reason that developers on their platform are doing calculations using date strings, or is it something else?
I wonder if it is related to the fact that MS systems apparently store the system time in local time instead of UTC?
At least, that's my understanding of how they store it, could be wrong.
Re:Same Story / Different Day (Score:5, Interesting)
What is with MS and their apparent inability to cope with leap years?
I would like to know the same thing. This seems to be systemic.
Yeah; it's systemic. Or at least it used to be a few years back, and I wouldn't be surprised if they haven't fixed the basic problem yet. The problem is fairly simple: Windows' internal clock is in local time.
To a programmer with experience writing date/time code, I've found that this is all you need to tell them. Any software whose internal clock is in local time will be buggy, and it will never be completely fixed. Attempts to fix bugs will merely introduce bugs elsewhere in the chains of date/time handling. The sensible solution is to adopt a "universal time" internally, and convert at the last stage when you present the date/time to a human user. Yes, you theoretically can work with local time internally, but (teams of) humans can't actually make this work in practice. The best they can do is make it work in the "normal" cases. Bug fixes then tend to just move the time bugs around to different places in the code. But it can be very difficult to get management to accept this and agree to UT-only internally.
Java also used to specify local time internally (and may still do so, but I haven't used it in years). I worked on a number of projects where, after repeated date/time disasters at every switch to/from DST and every Feb 29, java was abandoned and everything was rewritten in a language (usually C++) whose libraries supported a UT timestamp and didn't have all those time bugs.
Does anyone know if MS Windows has introduced a UT internal time yet? If not, then we can reliably predict that such bugs will continue to plague their users.
Re: (Score:3)
To the best of my knowledge, Java's System.currentTimeMillis() has always been UTC milliseconds since the UNIX epoch.
Don't stop someone messing up and having the computer set up badly, eg correct apparent local time but wrong timezone, causing "UTC" time to wrong.
Rgds
Damon
Re:Same Story / Different Day (Score:4, Funny)
According to Microsoft all time started on Jan 1, 0001.
http://msdn.microsoft.com/en-us/library/system.datetime.ticks.aspx [microsoft.com]
No fricking wonder the "system idle process" uses 19% of a cpu. The OS is counting to a billion every second.
Ooops. But they still lots of things in 32bit land, too.
http://msdn.microsoft.com/en-us/library/system.datetime.aspx [microsoft.com]
Re: (Score:3)
Their calendar ending is extremely significant! To anyone running a Mayan computer. Those things are all going to crash hard this year! It's going to be like Y2K all over! :)
Re: (Score:3)
Re: (Score:2)
I'm wondering if the bug started in some early version of ms-dos and then propagated to all their other projects ever since through code re-use...
Re:Same Story / Different Day (Score:5, Interesting)
Re:Same Story / Different Day (Score:5, Funny)
Well, we knew it was a Microsoft product so we knew they bought it from someone.
Re:Same Story / Different Day (Score:4, Informative)
Yeah, it was a really stupid bug, especially when you consider the OS provides a very useful set of APIs for dealing with it (basically convert a SYSTEMTIME (day/month/year/mm/hh/ss) into a FILETIME (64-bit unsigned int similar to time_t), do your math (the compiler will handle the 64-bit computations for you) and convert it back. Two OS calls.
If you're having ot do leap year calculations or even any sort of date calculations, stop. The OS or library will probably already have a set of functions for doing date calculations without you have to do it manually. Given how easy they are to screw up, far better to leave it to someone else.
Hell, given Windows worked fine, I don't even want to know what Azure is doing - the fundamental OS and runtimes all handle leap year date calculations with aplomb. Heck, that might be some of the oldest code in the kernel these days because it was written a long time ago, works well and has been thoroughly debugged through the decades.
Re:Same Story / Different Day (Score:5, Interesting)
Re: (Score:2)
"If you're a Zune Pass subscriber, you may need to sync your device with your PC to refresh the rights to the subscription content you have downloaded to your device.
That's the scariest sentence in that entire article.
Re: (Score:3)
Great... now I'm going to have nightmares
28 days (Score:5, Funny)
Re: (Score:2)
http://www.usnews.com/news/articles/2012/02/29/february-29-all-work-and-no-pay
Re: (Score:2)
In 2012 there are 261 weekdays. In 2011, there were 260. But in 2010 and 2009 there were also 261 weekdays. So not really.
Re: (Score:2)
yeah... because salary workers never work on weekends.
Funny, as an hourly person, I never work weekends..or more then 40 a week.
Re: (Score:2)
Article makes an assumption that is usually not true--that salaried workers get paid on a month-related schedule, which always causes all sorts of problems, not just leap-year related. I'm salaried and get paid every two weeks; leap year means squat as far as getting paid is concerned.
Re: (Score:2)
When I was on salary (back in ancient times) the deal was that I got $10,000/year (I said ancient times) paid out in 12 equal installments. Yes, the pay periods varied in length. So what?
Re:28 days (Score:5, Funny)
What is it with Microsoft and Leap Year? (Score:3, Informative)
Re:What is it with Microsoft and Leap Year? (Score:5, Informative)
Now, I'm not necessarily a Microsoft apologist, but I have to point out that it wasn't so long ago that other things near and dear to us geeks were experiencing similar problems.
I was trying to run some ant scripts yesterday that interact with an FTP server to delete some files. Those damned files wouldn't get deleted. They weren't even returned from a listing command. As it turns out, I was using a particularly old version of Apache Commons-Net library (this jar file was from 2005) which had a leap-year bug. It simply would not show me files with modification dates of 2/29. I was looking at the FTP server configuration, logging in with other clients, moving and renaming files, and all about ready to break out Wireshark... and then it occurred to me that it was leap day. Hoo-fucking-ray. "touch"ed the file, and sure enough, it was suddenly available. Those are a few hours of my life I'll never get back.
Re: (Score:3, Funny)
Now, I'm not necessarily a Microsoft apologist, but I have to point out that it wasn't so long ago that other things near and dear to us geeks were experiencing similar problems.
I was trying to run some ant scripts yesterday that interact with an FTP server to delete some files. Those damned files wouldn't get deleted. They weren't even returned from a listing command. As it turns out, I was using a particularly old version of Apache Commons-Net library (this jar file was from 2005) which had a leap-year bug. It simply would not show me files with modification dates of 2/29. I was looking at the FTP server configuration, logging in with other clients, moving and renaming files, and all about ready to break out Wireshark... and then it occurred to me that it was leap day. Hoo-fucking-ray. "touch"ed the file, and sure enough, it was suddenly available. Those are a few hours of my life I'll never get back.
Your post is not anti-Microsoft, so you must be a shill.
Re: (Score:2)
> Your post is not anti-Microsoft, so you must be a shill.
It's anti-Java and therefor anti-Oracle. That makes it ok.
Re: (Score:3)
As it turns out, I was using a particularly old version of Apache Commons-Net library (this jar file was from 2005) which had a leap-year bug. It simply would not show me files with modification dates of 2/29.
If this is the bug you're talking about [apache.org], it appears a bug report was filed, discussed, and a temporary workaround was offered (perhaps more than one). Although free software has bugs just like proprietary software, the way they are reported and handled is night and day.
Re: (Score:3)
If this is the bug you're talking about [apache.org], it appears a bug report was filed, discussed, and a temporary workaround was offered (perhaps more than one). Although free software has bugs just like proprietary software, the way they are reported and handled is night and day.
That appears to be the very same bug, yes. And I'm not disputing the handling of the bug, but merely pointing out that even with "many eyes", this bug existed not so very long ago. Open Source is not immune to the same kinds of problems, though I grant you I probably should have been checking for the latest compatible libraries.
Re: (Score:3)
Open Source is not immune to the same kinds of problems, though I grant you I probably should have been checking for the latest compatible libraries.
Whenever I run into a bug like this where something just seems to go *poof*, that's the first thing I check. Learned my lesson a few years ago with an old version of a library and a cranky sysadmin who would just not believe it was the source of the problem. I wrote not one, but *three* workarounds, none of which he was willing to use, then he just updated the library with the latest version and everything worked.
Re: (Score:2)
Do I have a job for you!
To the cloud (Score:2)
In a new press conference.. (Score:5, Funny)
Microsoft has told the press that they don't expect the Azure cloud service to fail again for years. In an unrelated schedule change, a down-for-maintenance slot was scheduled 4 years in advance.
Re: (Score:2)
It took the Zune slot.
office in the cloud (Score:5, Funny)
It's sold as Office 365 not Office 366
Re: (Score:2)
This behaviour is by design.
Prepared for future (Score:4, Informative)
Re:Prepared for future (Score:5, Informative)
Re: (Score:3)
UNIX Epoch FTW (Score:2)
Re: (Score:2)
Not necessarily. It probably wouldn't have been a complete failure, but even software using epoch time internally can have problems.
Remember, you still have to put it into Regular Time to display to users. And take input in from users - if someone schedules a task for 2011-02-29, it should fail, but if it's scheduled for 2012-02-29 it should be allowed. And maybe there's business logic to figure out whether it's a weekend and such, which could easily be thrown off during the UNIX->Gregorian conversion...
Everything MS does as "me too" sucks. (Score:5, Insightful)
It seems that all of MS's copied products - hotmail, Azure, Zune are all done with a "me too" attitude of just having something so that they don't get left behind. They don't really try to make these "me too" products as industry leaders. But here's the catch. I know plenty of IT people who will always choose MS's offering because, as I was told "you don't get in trouble for choosing MS". And that knowledge seems to be built into MS's offerings.
Re: (Score:3)
You 'don't get fired for buying IBM.'
Then 'You don't get fired for buying MS'
Next will be Google, probably.
Never Apple because they are a high end system in a niche market. Yea, people willing to pay 1200+ for a computer is a high end niche market.
Re: (Score:3)
Never Apple because they are a high end system in a niche market.
Yeah, I'm pretty sure joining that race to the bottom is the way to go.
Re: (Score:3)
Um, no. Hotmail was acquired by Microsoft in 1997. Hotmail originally ran on FreeBSD and Solaris. Eventually Microsoft ported it to Windows but they have reliability problems for a long time. I'm surprised you got this so badly wrong given your impressively low Slashdot ID. When corrected, the example you gave exactly counters the argument you were trying to make. I guess that the true origins of Hotmail are not something that Microsoft hides, but probably not something they mention much either. In fact, i
If MS starts making excuses (Score:2)
If MS starts making excuses for this mini fiasco, they will only manage to make Amazon, Google and Apple look like fucking geniuses
Only Happens Every 4 Years (Score:5, Funny)
It's not Micorsoft's fault; they're a publicly traded company so they can't think about multi-year events. They're prohibited from considering anything that is beyond the next fiscal quarter.
Re: (Score:2)
They pay no dividend for that extra day. That they don't know about. And didn't learn about last leap year.
Inexcusable (Score:2)
Simply inexcusable.
Re: (Score:3)
True
Apparently MS only hires people with no concept of future.
See Zune event, see this. Absolute disregard of date concepts (and testing)
Now, Linux development worries about this THOROUGHLY (I mean, kernel and main libs, of course a sw developer can get this wrong on Linux as well)
No counter wrap-around fsck-ups, date exceptions, etc (people are watching this)
Remember the Windows bug where it would crash after 15 days or so?
Re: (Score:2)
It was 49.7 days: http://news.cnet.com/Windows-may-crash-after-49.7-days/2100-1040_3-222391.html [cnet.com]
And still inexcusable.
Re: (Score:3)
It was 49.7 days: http://news.cnet.com/Windows-may-crash-after-49.7-days/2100-1040_3-222391.html [cnet.com]
And still inexcusable.
I remember that one - it wasn't a crash in the usual sense, where something stops working completely. It was far more insidious than that. Everything still looked as if it was working; the cursor moved when you moved the mouse, icons would highlight if single-clicked, but double-click would refuse to play...
IIRC, the 49.7 days is 2^16 seconds
Single Point of Failure (Score:5, Insightful)
Re:Single Point of Failure (Score:5, Insightful)
Thats a flaw in the idea of a monoculture, true redundancy has different software implementing the same basic standards...
Like how the Internet is built from routers made by different vendors, cisco, juniper, software based linux/bsd devices etc. When new DoS vulnerabilities are found in one vendors kit it doesn`t take down the whole internet, because other vendors are immune.
Re: (Score:2)
Unless there is a bug in the "same basic standards" that the heterogeneous systems use...
Re: (Score:2)
Unless there is a bug in the "same basic standards" that the heterogeneous systems use...
Ah... like the WMF exploit... I was explaining to someone at the time that the problem with the exploit was that it was functioning exactly as designed and intended. >_
Re: (Score:2)
That's why I'm about to launch my new "Ionosphere" product. It is a cloud that runs on top of clouds. So when Azure fails Amazon picks up the slack and so on. Trademarked/copyrighted/patent pending suckas.
-d
A leap year issue? Are you SERIOUS? (Score:5, Insightful)
Given how many DECADES leap year calculations have had to be done and how many years it's been since we fixed the Y2K issues (at great expense, I might add), it is absolutely UNACCEPTABLE for someone to blame a leap year calculation for down time.
The DIRECTOR of the service division at Microsoft should be FIRED for this failure.
Expect lawsuits from customers, Microsoft. Because this was a problem you KNEW about and should have written code to deal with.
What a pathetic excuse for planning and testing on Microsoft's part.
Re: (Score:3, Funny)
Shouldn't 'pathetic' be in uppercase?
Re: (Score:2)
> Expect lawsuits from customers...
I'm sure it's covered by the TOS. That's an area where they DONT make mistakes.
Re: (Score:2)
Maybe everyone should get together and file a class action suit - the lawyers will get millions, the users will get 30 seconds of their life back.
Re: (Score:3)
Yep, as soon as they fire the guy at Apple who was responsible for the iPhone daylight saving bug in November 2010 [pcworld.com]. And again [pcworld.com] in March 2011.
Also Flickr [dpreview.com], with their own leap year bug. And Sony [dpreview.com].
Not saying this isn't a black eye for MS - and yes, testing for leap year should be thought out ahead of time - but in fairness they're not the only people to have ever been caught out by something like this.
Re: (Score:2)
Authentication issues perhaps?
Re: (Score:2)
The director has nothing to do with this.
Really? Perhaps you can explain precisely what the director in charge of this service is doing to earn their salary?
Re: (Score:3)
I've been programming for over 30 years.
Yes, I know damned well how widely date manipulation packages are used.
And I know what this little thing called "regression testing" is.
And I blame the MANAGEMENT for failing to ensure that the testing was done before approving rollout of the software. Why do you think management gets paid the big bucks? Because at the end of the day, THEY'RE responsible for ensuring that joe-schmoe-programmer and the QA team did their jobs before they sign off on rolling out
It wasn't just Microsoft... (Score:5, Interesting)
...they just had the most publicly catastrophic failure. I just noticed that all of the Google Chat messages I received yesterday were sent to me at various times on December 31, 1969.
And it also seems that I didn't even receive any of them until today, March 1, implying that they were incapable of even sending them yesterday.
Arthur David Olson is my hero (Score:5, Informative)
Not a surprise... (Score:2)
All bets are off... (Score:2)
Bet it wont happen again next year!
Does anyone know... (Score:3)
WTF? (Score:2)
Attention Microsoft: (Score:5, Funny)
Re:Attention Microsoft: (Score:4, Funny)
The thought of an MS pacemaker EULA is pretty scary....
Some of the most common leap-year bugs (Score:5, Informative)
Some of the common leap year bugs that I've seen over the years:
1. A matrix with the number of days per month:
e.g. smallint dayspermonth[12]={31,28,31,30,31,30,31,31,30,31,30,31};
Indexing into the matrix for February (index 1) ignores leap years.
1. A matrix with 365 elements to represent a year's worth of something:
e.g. smallint hightemps[365];
This usually doesn't fail until Dec 31, when hightemp[mydate.dayofyear()-1] points to a non-existent element.
Of course, if dayofyear is calculated using the matrix in the prior bug, it will fail invisibly since that will be incorrect
as well.
2. Quck-n-dirty subtract one year math:
e.g. Convert date to char in YYYYDDMM format, convert char to int, subtract 10000, convert back to a char and then date.
Why people do this when you can dateadd(year,mydate,-1) is that easy, I have no clue. But it breaks horridly when
you use it to determine "one year ago today" from Feb 29.
Re: (Score:3)
... Convert date to char in YYYYDDMM format ...
A few years ago, one of the fixes I made to a front end security tool was to change all the YYYYDDMMs to YYYYMMDDs, to match up with all the other YYYYMMDDs elsewhere in the code.
It's astonishing to me that having gone through Y2k, some people are *still* failing miserably at handling dates. I mean, it's not like YYYYMMDD is an ISO Standard or anything, right?
Microsoft Never Has Been Good At Time (Score:5, Interesting)
Funnily enough, I used to work at IBM doing OS/2 tech support. OS/2 and Windows NT share a common heritage, so a lot of the behind-the-scenes problems I witnessed in OS/2 were (And sometimes still are) problems with Windows. I'm not sure if this is one of them, but I got a call once from a guy who was trying to use his OS/2 system to track satellites. The problem was, the OS/2 timer API specified that you could set milliseconds but it didn't seem to work. I tracked it down to a timing driver which tracked two separate interrupts. The first interrupt happened every few milliseconds and would update the clock millis when that happened. However, if the system was busy it was possible to not handle that interrupt. There was also a system periodic interrupt every 1 second. When that occurred, the system hard-reset the milli time and incremented the seconds. So you could set the millis, but the clock would become inaccurate 1 second later. Just one example of how time has been a thorn in my side for my entire career. I wrote an APAR up on it which was promptly closed "Working as Designed." Dunno if he ever got it fixed...
Typical Microsoft Spin (Score:2)
"Oh Hai, yeah we just carried the one! It's all fixed now. Nothing to worry about, back to work"
Time is tricky (Score:2)
As I keep telling my fellow software developers, time is one of those things in software that tend to go wrong. Few developers give it the attention it deserves. Between different formats, timezones, changing timezones (including DST), leap seconds, and limits on what can be represented, there is plenty of opportunity for errors. And contrary to what you might hope, using an existing library to handle time does not absolve you from having to think about it, nor does that library always get it right.
Gregorian calendar hacks Microsoft (Score:2)
Re: (Score:2)
Probably not. The folks who are still around from that era would probably be smart enough to have avoided this problem in the first place.
Re:Dumb people never learn (Score:5, Funny)
Re:Dumb people never learn (Score:4, Funny)
Hey! My MS4000 keyboard and MS mouse are working jut fine.
I see what you did there.
Re:What a shame (Score:5, Funny)
We still see this kind of XXXX coming up every leap year.
We're all adults (or close enough to it, anyway) here. I think we're all capable of seeing the word "shit" without our faces melting like that nazi who peeped in the ark.
My apologies to everyone who is now having their face melt off after reading that previous sentence.
Re: (Score:2)
Joel Spolsky has an article [joelonsoftware.com] about the 1900 Leap Year bug.