Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
Bug Cloud Microsoft IT

Azure Failure Was a Leap Year Glitch 247

judgecorp writes "Microsoft's Windows Azure cloud service was down much of yesterday, and the cause was a leap year bug as the service failed to handle the 29th day of February. Faults propagated making this a severe outage for many customers, including the UK Government's recently launched G-cloud service."
This discussion has been archived. No new comments can be posted.

Azure Failure Was a Leap Year Glitch

Comments Filter:
  • by elrous0 ( 869638 ) * on Thursday March 01, 2012 @10:30AM (#39208835)

    Seriously, if my American high school education taught me nothing else, it was that those things only come along like every 100 years or something.

  • by DownWithTheMan ( 797237 ) on Thursday March 01, 2012 @10:34AM (#39208927)
    Didn't this happen last leap year to the Zunes... oh yeah... [pcworld.com]
    • by g0bshiTe ( 596213 ) on Thursday March 01, 2012 @10:38AM (#39208985)
      You would think that they would have remembered, or some brilliant mind would have said "hey don't forget leap days", they should have asked the janitor. Those guys know everything.
    • by firex726 ( 1188453 ) on Thursday March 01, 2012 @10:41AM (#39209035)

      What is with MS and their apparent inability to cope with leap years?

      • I would like to know the same thing. This seems to be systemic. Is there something inherently confusing or flawed in the way Microsoft approaches elapsed-time calculations? Is it due to the internal representation of time that they use, or is there some reason that developers on their platform are doing calculations using date strings, or is it something else?

        • Either coincidence, or else they reused some old code from another department (As good programmers should - no point spending precious time reinventing the wheel) and thus inadvertantly used the old bug too.
          • The only word to describe this is incompetence.
          • by Dhalka226 ( 559740 ) on Thursday March 01, 2012 @01:05PM (#39211473)

            I had a similar thought about code reuse, but an entirely different conclusion: I thought that they weren't re-using good code since the same problem has cropped up at least two times. That sounds more like a case of re-rolling things that definitely shouldn't be re-rolled (date/time handling) to me.

            In either event, they're not using particularly good practices. Either they are constantly reinventing the wheel and apparently in error-prone ways, or they are re-using code but paying no attention to keeping that external code up to date.

            The only other thing I can think of is that Azure is somehow so drastically different than anything else they have ever done that they had to do the code again from scratch -- which is probably a problem all by itself.

        • by Maow ( 620678 )

          I would like to know the same thing. This seems to be systemic. Is there something inherently confusing or flawed in the way Microsoft approaches elapsed-time calculations? Is it due to the internal representation of time that they use, or is there some reason that developers on their platform are doing calculations using date strings, or is it something else?

          I wonder if it is related to the fact that MS systems apparently store the system time in local time instead of UTC?

          At least, that's my understanding of how they store it, could be wrong.

        • by jc42 ( 318812 ) on Thursday March 01, 2012 @01:59PM (#39212359) Homepage Journal

          What is with MS and their apparent inability to cope with leap years?

          I would like to know the same thing. This seems to be systemic.

          Yeah; it's systemic. Or at least it used to be a few years back, and I wouldn't be surprised if they haven't fixed the basic problem yet. The problem is fairly simple: Windows' internal clock is in local time.

          To a programmer with experience writing date/time code, I've found that this is all you need to tell them. Any software whose internal clock is in local time will be buggy, and it will never be completely fixed. Attempts to fix bugs will merely introduce bugs elsewhere in the chains of date/time handling. The sensible solution is to adopt a "universal time" internally, and convert at the last stage when you present the date/time to a human user. Yes, you theoretically can work with local time internally, but (teams of) humans can't actually make this work in practice. The best they can do is make it work in the "normal" cases. Bug fixes then tend to just move the time bugs around to different places in the code. But it can be very difficult to get management to accept this and agree to UT-only internally.

          Java also used to specify local time internally (and may still do so, but I haven't used it in years). I worked on a number of projects where, after repeated date/time disasters at every switch to/from DST and every Feb 29, java was abandoned and everything was rewritten in a language (usually C++) whose libraries supported a UT timestamp and didn't have all those time bugs.

          Does anyone know if MS Windows has introduced a UT internal time yet? If not, then we can reliably predict that such bugs will continue to plague their users.

          • by DamonHD ( 794830 )

            To the best of my knowledge, Java's System.currentTimeMillis() has always been UTC milliseconds since the UNIX epoch.

            Don't stop someone messing up and having the computer set up badly, eg correct apparent local time but wrong timezone, causing "UTC" time to wrong.

            Rgds

            Damon

        • by ekimminau ( 775300 ) <eak@kimminau.org> on Thursday March 01, 2012 @02:06PM (#39212495) Homepage Journal

          According to Microsoft all time started on Jan 1, 0001.

          http://msdn.microsoft.com/en-us/library/system.datetime.ticks.aspx [microsoft.com]

          No fricking wonder the "system idle process" uses 19% of a cpu. The OS is counting to a billion every second.

          Ooops. But they still lots of things in 32bit land, too.
          http://msdn.microsoft.com/en-us/library/system.datetime.aspx [microsoft.com]

    • Yes and it will happen again. As far as I know MS never created a patch. They just told customers to wait out the bug. Looks like it will happen again this year.
      • I'm wondering if the bug started in some early version of ms-dos and then propagated to all their other projects ever since through code re-use...

        • by UnknowingFool ( 672806 ) on Thursday March 01, 2012 @11:27AM (#39209897)
          No it came from Freescale in a driver that Toshiba used. Not many know that the original Zune was a Toshiba Gigabeat with a new UI and outer shell.
          • by John Hasler ( 414242 ) on Thursday March 01, 2012 @11:36AM (#39210033) Homepage

            Well, we knew it was a Microsoft product so we knew they bought it from someone.

          • by tlhIngan ( 30335 ) <slashdot&worf,net> on Thursday March 01, 2012 @12:27PM (#39210859)

            No it came from Freescale in a driver that Toshiba used. Not many know that the original Zune was a Toshiba Gigabeat with a new UI and outer shell.

            Yeah, it was a really stupid bug, especially when you consider the OS provides a very useful set of APIs for dealing with it (basically convert a SYSTEMTIME (day/month/year/mm/hh/ss) into a FILETIME (64-bit unsigned int similar to time_t), do your math (the compiler will handle the 64-bit computations for you) and convert it back. Two OS calls.

            If you're having ot do leap year calculations or even any sort of date calculations, stop. The OS or library will probably already have a set of functions for doing date calculations without you have to do it manually. Given how easy they are to screw up, far better to leave it to someone else.

            Hell, given Windows worked fine, I don't even want to know what Azure is doing - the fundamental OS and runtimes all handle leap year date calculations with aplomb. Heck, that might be some of the oldest code in the kernel these days because it was written a long time ago, works well and has been thoroughly debugged through the decades.

    • "If you're a Zune Pass subscriber, you may need to sync your device with your PC to refresh the rights to the subscription content you have downloaded to your device.

      That's the scariest sentence in that entire article.

    • Picture this: You're gearing up to create a killer playlist on your 30GB Zune for your annual New Year's bash.

      Great... now I'm going to have nightmares

  • 28 days (Score:5, Funny)

    by ichthus ( 72442 ) on Thursday March 01, 2012 @10:35AM (#39208937) Homepage
    Well, this is all because 28 days in February ought to be enough for everyone.
    • The people who pay employees by salary instead of hourly certainly think so.

      http://www.usnews.com/news/articles/2012/02/29/february-29-all-work-and-no-pay

      • by tuffy ( 10202 )

        In 2012 there are 261 weekdays. In 2011, there were 260. But in 2010 and 2009 there were also 261 weekdays. So not really.

        • by geekoid ( 135745 )

          yeah... because salary workers never work on weekends.

          Funny, as an hourly person, I never work weekends..or more then 40 a week.

      • Article makes an assumption that is usually not true--that salaried workers get paid on a month-related schedule, which always causes all sorts of problems, not just leap-year related. I'm salaried and get paid every two weeks; leap year means squat as far as getting paid is concerned.

        • When I was on salary (back in ancient times) the deal was that I got $10,000/year (I said ancient times) paid out in 12 equal installments. Yes, the pay periods varied in length. So what?

    • Re:28 days (Score:5, Funny)

      by davidbrit2 ( 775091 ) on Thursday March 01, 2012 @11:16AM (#39209687) Homepage
      I always remember to put DEVICEHIGH=FEB.SYS into my config.sys every four years.
  • by madsci1016 ( 1111233 ) on Thursday March 01, 2012 @10:37AM (#39208969) Homepage
    Anyone remember trying to turn on their Zune 3.5 years ago? [slashdot.org] That didn't work so well either.
    • by Kozz ( 7764 ) on Thursday March 01, 2012 @11:14AM (#39209635)

      Now, I'm not necessarily a Microsoft apologist, but I have to point out that it wasn't so long ago that other things near and dear to us geeks were experiencing similar problems.

      I was trying to run some ant scripts yesterday that interact with an FTP server to delete some files. Those damned files wouldn't get deleted. They weren't even returned from a listing command. As it turns out, I was using a particularly old version of Apache Commons-Net library (this jar file was from 2005) which had a leap-year bug. It simply would not show me files with modification dates of 2/29. I was looking at the FTP server configuration, logging in with other clients, moving and renaming files, and all about ready to break out Wireshark... and then it occurred to me that it was leap day. Hoo-fucking-ray. "touch"ed the file, and sure enough, it was suddenly available. Those are a few hours of my life I'll never get back.

      • Re: (Score:3, Funny)

        by egamma ( 572162 )

        Now, I'm not necessarily a Microsoft apologist, but I have to point out that it wasn't so long ago that other things near and dear to us geeks were experiencing similar problems.

        I was trying to run some ant scripts yesterday that interact with an FTP server to delete some files. Those damned files wouldn't get deleted. They weren't even returned from a listing command. As it turns out, I was using a particularly old version of Apache Commons-Net library (this jar file was from 2005) which had a leap-year bug. It simply would not show me files with modification dates of 2/29. I was looking at the FTP server configuration, logging in with other clients, moving and renaming files, and all about ready to break out Wireshark... and then it occurred to me that it was leap day. Hoo-fucking-ray. "touch"ed the file, and sure enough, it was suddenly available. Those are a few hours of my life I'll never get back.

        Your post is not anti-Microsoft, so you must be a shill.

        • > Your post is not anti-Microsoft, so you must be a shill.

          It's anti-Java and therefor anti-Oracle. That makes it ok.

      • As it turns out, I was using a particularly old version of Apache Commons-Net library (this jar file was from 2005) which had a leap-year bug. It simply would not show me files with modification dates of 2/29.

        If this is the bug you're talking about [apache.org], it appears a bug report was filed, discussed, and a temporary workaround was offered (perhaps more than one). Although free software has bugs just like proprietary software, the way they are reported and handled is night and day.

        • by Kozz ( 7764 )

          If this is the bug you're talking about [apache.org], it appears a bug report was filed, discussed, and a temporary workaround was offered (perhaps more than one). Although free software has bugs just like proprietary software, the way they are reported and handled is night and day.

          That appears to be the very same bug, yes. And I'm not disputing the handling of the bug, but merely pointing out that even with "many eyes", this bug existed not so very long ago. Open Source is not immune to the same kinds of problems, though I grant you I probably should have been checking for the latest compatible libraries.

          • Open Source is not immune to the same kinds of problems, though I grant you I probably should have been checking for the latest compatible libraries.

            Whenever I run into a bug like this where something just seems to go *poof*, that's the first thing I check. Learned my lesson a few years ago with an old version of a library and a cranky sysadmin who would just not believe it was the source of the problem. I wrote not one, but *three* workarounds, none of which he was willing to use, then he just updated the library with the latest version and everything worked.

      • by Nikker ( 749551 )
        So your saying a self rolled script working off a binary from almost 10 years ago should apply to a multi-billion dollar a year company's server farm offering connections to the worlds biggest corporations and governments?

        Do I have a job for you!
  • This is probably part of the reason why the cloud really hasn't taken off in the corporate sector, and it's no wonder why.
  • by Anonymous Coward on Thursday March 01, 2012 @10:37AM (#39208977)

    Microsoft has told the press that they don't expect the Azure cloud service to fail again for years. In an unrelated schedule change, a down-for-maintenance slot was scheduled 4 years in advance.

  • by Anonymous Coward on Thursday March 01, 2012 @10:39AM (#39208997)

    It's sold as Office 365 not Office 366

  • Prepared for future (Score:4, Informative)

    by gmuslera ( 3436 ) * on Thursday March 01, 2012 @10:39AM (#39209011) Homepage Journal
    If they can't handle an exception that is around since 2k years ago, what about newer exception? Would be interesting to see what could happen next June 30.
  • Correct me if I'm wrong, nut they could have avoided this by relying on the UNIX epoch. Same with Y2K. But beware Y2K38 you 32-bit users!
    • Not necessarily. It probably wouldn't have been a complete failure, but even software using epoch time internally can have problems.

      Remember, you still have to put it into Regular Time to display to users. And take input in from users - if someone schedules a task for 2011-02-29, it should fail, but if it's scheduled for 2012-02-29 it should be allowed. And maybe there's business logic to figure out whether it's a weekend and such, which could easily be thrown off during the UNIX->Gregorian conversion...

  • by scorp1us ( 235526 ) on Thursday March 01, 2012 @10:48AM (#39209143) Journal

    It seems that all of MS's copied products - hotmail, Azure, Zune are all done with a "me too" attitude of just having something so that they don't get left behind. They don't really try to make these "me too" products as industry leaders. But here's the catch. I know plenty of IT people who will always choose MS's offering because, as I was told "you don't get in trouble for choosing MS". And that knowledge seems to be built into MS's offerings.

    • by geekoid ( 135745 )

      You 'don't get fired for buying IBM.'
      Then 'You don't get fired for buying MS'
      Next will be Google, probably.

      Never Apple because they are a high end system in a niche market. Yea, people willing to pay 1200+ for a computer is a high end niche market.

      • Never Apple because they are a high end system in a niche market.

        Yeah, I'm pretty sure joining that race to the bottom is the way to go.

  • If MS starts making excuses for this mini fiasco, they will only manage to make Amazon, Google and Apple look like fucking geniuses

  • by trongey ( 21550 ) on Thursday March 01, 2012 @10:53AM (#39209235) Homepage

    It's not Micorsoft's fault; they're a publicly traded company so they can't think about multi-year events. They're prohibited from considering anything that is beyond the next fiscal quarter.

    • They pay no dividend for that extra day. That they don't know about. And didn't learn about last leap year.

  • Simply inexcusable.

    • by JamesP ( 688957 )

      True

      Apparently MS only hires people with no concept of future.

      See Zune event, see this. Absolute disregard of date concepts (and testing)

      Now, Linux development worries about this THOROUGHLY (I mean, kernel and main libs, of course a sw developer can get this wrong on Linux as well)

      No counter wrap-around fsck-ups, date exceptions, etc (people are watching this)

      Remember the Windows bug where it would crash after 15 days or so?

  • by Bicx ( 1042846 ) on Thursday March 01, 2012 @10:54AM (#39209251)
    This points out a serious flaw in the whole idea of cloud reliability by redundancy. You may have a million servers running across multiple countries, but if the distributed software for each virtual server has a bug, every server across the globe is affected. That's a single point of failure.
    • by Bert64 ( 520050 ) <.moc.eeznerif.todhsals. .ta. .treb.> on Thursday March 01, 2012 @11:06AM (#39209487) Homepage

      Thats a flaw in the idea of a monoculture, true redundancy has different software implementing the same basic standards...
      Like how the Internet is built from routers made by different vendors, cisco, juniper, software based linux/bsd devices etc. When new DoS vulnerabilities are found in one vendors kit it doesn`t take down the whole internet, because other vendors are immune.

      • Unless there is a bug in the "same basic standards" that the heterogeneous systems use...

        • Unless there is a bug in the "same basic standards" that the heterogeneous systems use...

          Ah... like the WMF exploit... I was explaining to someone at the time that the problem with the exploit was that it was functioning exactly as designed and intended. >_

    • That's why I'm about to launch my new "Ionosphere" product. It is a cloud that runs on top of clouds. So when Azure fails Amazon picks up the slack and so on. Trademarked/copyrighted/patent pending suckas.

      -d

  • by msobkow ( 48369 ) on Thursday March 01, 2012 @10:58AM (#39209325) Homepage Journal

    Given how many DECADES leap year calculations have had to be done and how many years it's been since we fixed the Y2K issues (at great expense, I might add), it is absolutely UNACCEPTABLE for someone to blame a leap year calculation for down time.

    The DIRECTOR of the service division at Microsoft should be FIRED for this failure.

    Expect lawsuits from customers, Microsoft. Because this was a problem you KNEW about and should have written code to deal with.

    What a pathetic excuse for planning and testing on Microsoft's part.

    • Re: (Score:3, Funny)

      by Anonymous Coward

      Shouldn't 'pathetic' be in uppercase?

    • > Expect lawsuits from customers...

      I'm sure it's covered by the TOS. That's an area where they DONT make mistakes.

    • Maybe everyone should get together and file a class action suit - the lawyers will get millions, the users will get 30 seconds of their life back.

    • Yep, as soon as they fire the guy at Apple who was responsible for the iPhone daylight saving bug in November 2010 [pcworld.com]. And again [pcworld.com] in March 2011.

      Also Flickr [dpreview.com], with their own leap year bug. And Sony [dpreview.com].

      Not saying this isn't a black eye for MS - and yes, testing for leap year should be thought out ahead of time - but in fairness they're not the only people to have ever been caught out by something like this.

  • by Anonymous Coward on Thursday March 01, 2012 @11:04AM (#39209447)

    ...they just had the most publicly catastrophic failure. I just noticed that all of the Google Chat messages I received yesterday were sent to me at various times on December 31, 1969.

    And it also seems that I didn't even receive any of them until today, March 1, implying that they were incapable of even sending them yesterday.

  • by Bruce Perens ( 3872 ) <bruce@perens.com> on Thursday March 01, 2012 @11:10AM (#39209561) Homepage Journal
    30 years ago, Arthur David Olson started engineering a solution to this problem that persists to this day, and which he supported personally for all but the last few months. The systems I have that run his software have never even burped through legislative changes of the calendar, leap-seconds, and the Century leap-year day, which is a separate cycle from the 4-year one.
  • The story yesterday said that they were having a problem with certificate validation. The routine they were using to validate certificate expiration must not have been able to handle the leap year. I wonder what non-standard API they were using to process the expiration date. That reminds me of another article [thedailywtf.com] that I read yesterday.
  • Bet it wont happen again next year!

  • by roc97007 ( 608802 ) on Thursday March 01, 2012 @11:19AM (#39209753) Journal

    ...how Windows for cars and Windows for Warships fared during the leap year? Forget Y2K -- the apocalypse comes every four years...

  • While we jab at MS for the Zune fiasco, to their defense, they didn't write the subroutine that caused the problem and the most that happened was some of their customers could not play music on their Zunes for a while. Not a mission critical situation. But what the hell kind of calendar system is MS using in their mission critical software that cannot deal with leap years which comes every 4 years?
  • by Howard Beale ( 92386 ) on Thursday March 01, 2012 @11:26AM (#39209873)
    The following are leap years: 2016 2020 2024 2028 2032 2036 2040 You have been warned. After that, I'll probably be dead, so I won't care (unless Microsoft starts making pacemakers, which may end it for me...).
  • by tillerman35 ( 763054 ) on Thursday March 01, 2012 @11:29AM (#39209943)

    Some of the common leap year bugs that I've seen over the years:

    1. A matrix with the number of days per month:
    e.g. smallint dayspermonth[12]={31,28,31,30,31,30,31,31,30,31,30,31};
    Indexing into the matrix for February (index 1) ignores leap years.

    1. A matrix with 365 elements to represent a year's worth of something:
    e.g. smallint hightemps[365];
    This usually doesn't fail until Dec 31, when hightemp[mydate.dayofyear()-1] points to a non-existent element.
    Of course, if dayofyear is calculated using the matrix in the prior bug, it will fail invisibly since that will be incorrect
    as well.

    2. Quck-n-dirty subtract one year math:
    e.g. Convert date to char in YYYYDDMM format, convert char to int, subtract 10000, convert back to a char and then date.
    Why people do this when you can dateadd(year,mydate,-1) is that easy, I have no clue. But it breaks horridly when
    you use it to determine "one year ago today" from Feb 29.

    • by tqk ( 413719 )

      ... Convert date to char in YYYYDDMM format ...

      A few years ago, one of the fixes I made to a front end security tool was to change all the YYYYDDMMs to YYYYMMDDs, to match up with all the other YYYYMMDDs elsewhere in the code.

      It's astonishing to me that having gone through Y2k, some people are *still* failing miserably at handling dates. I mean, it's not like YYYYMMDD is an ISO Standard or anything, right?

  • by Greyfox ( 87712 ) on Thursday March 01, 2012 @11:35AM (#39210027) Homepage Journal
    Dealing with time is hard, but it's been amusing to watch them experience problems solved by UNIX decades earlier. Daylight savings time was a constant problem for them in the early days, though they seem to have mostly got that ironed out. Every so often they seem to have a regression for a piece of new hardware. Maybe they'll eventually get it right.

    Funnily enough, I used to work at IBM doing OS/2 tech support. OS/2 and Windows NT share a common heritage, so a lot of the behind-the-scenes problems I witnessed in OS/2 were (And sometimes still are) problems with Windows. I'm not sure if this is one of them, but I got a call once from a guy who was trying to use his OS/2 system to track satellites. The problem was, the OS/2 timer API specified that you could set milliseconds but it didn't seem to work. I tracked it down to a timing driver which tracked two separate interrupts. The first interrupt happened every few milliseconds and would update the clock millis when that happened. However, if the system was busy it was possible to not handle that interrupt. There was also a system periodic interrupt every 1 second. When that occurred, the system hard-reset the milli time and incremented the seconds. So you could set the millis, but the clock would become inaccurate 1 second later. Just one example of how time has been a thorn in my side for my entire career. I wrote an APAR up on it which was promptly closed "Working as Designed." Dunno if he ever got it fixed...

  • "Oh Hai, yeah we just carried the one! It's all fixed now. Nothing to worry about, back to work"

  • As I keep telling my fellow software developers, time is one of those things in software that tend to go wrong. Few developers give it the attention it deserves. Between different formats, timezones, changing timezones (including DST), leap seconds, and limits on what can be represented, there is plenty of opportunity for errors. And contrary to what you might hope, using an existing library to handle time does not absolve you from having to think about it, nor does that library always get it right.

  • This just in, the Vatican is raided by the FBI in search of XVI century documents and the evil mastermind of today's attack: Pope Gregory. Joining the FBI are Seals Team Six and Chuck Norris, who brought an abacus. The team sprays Gregory's coffin with bullets. Obama announces the permanent change to a non-leap calendar. The world is safe. Cue in Aerosmith.

You are always doing something marginal when the boss drops by your desk.

Working...