The BBC Looks At Rollover Bugs, Past and Approaching 59
New submitter Merovech points out an article at the BBC which makes a good followup to the recent news (mentioned within) about a bug in Boeing's new 787. The piece explores various ways that rollover bugs in software have led to failures -- some of them truly disastrous, others just annoying. The 2038 bug is sure to bite some people; hopefully it will be even less of an issue than the Year 2000 rollover. From the article:
It was in 1999 that I first wrote about this," comments [programmer William] Porquet. "I acquired the domain name 2038.org and at first it was very tongue-in-cheek. It was almost a piece of satire, a kind of an in-joke with a lot of computer boffins who say, 'oh yes we'll fix that in 2037' But then I realised there are actually some issues with this.
Ask Mel (Score:5, Funny)
It's not a rollover bug; It's a rollover feature [catb.org]!
Re: (Score:2)
I already posted, so I can't mod... one of the coolest geek stories around.
You cant win... (Score:5, Insightful)
If you reuse code, you get rollover bugs.
If you start over from scratch you get brand new bugs.
Reusing the code, you have a lot of the issue from the past already fixed, so you are not introducing bugs that you had in the past.
Making new code, you can modernise the code set, so you don't run into particular troubled code, and is easier to follow.
Programmers are human beings, they make mistakes, they can't give 110% every day. Even the best of them will often have a stupid bug, that they can't believe that they had slip.
Re: (Score:1)
I personally have encountered rollover bugs not just in operating systems... but in machine firmware. One ultra-expensive piece of equipment had a firmware issue that would cause big issues if the machine stayed up for more than 18-24 months, so every so often, after a major patch for stuff was released, part of the maintenance window was physically shutting off an entire rack of stuff, depowering it (as in physically pulling the plugs), waiting 10-15 minutes for the capacitors to drain, then plugging it b
Re: (Score:2)
At 248 days on-line, the Boeing Dreamliner has a similar problem.
http://www.wsj.com/articles/fa... [wsj.com]
Not a problem I want to have at 40,000 feet (about 12 km).
Re: (Score:2)
If you're a programmer and you believe you can give 110% even for one day, you have other problems.
Re: (Score:2)
Re: (Score:2)
If you normally work for an hour a day that means you're only giving ~0.041667% to begin with.
Re: (Score:2)
Sure you can give 110%, just like an engine can deliver 110% of sustained power (which is how marine and aviation engines are rated, not instantaneous power). It just can't do that indefinitely, and it decreases reliability. Either service must be more frequent, or it's going to fail sooner, or both. Nonetheless, many marine engines are held in excess of 100% for hours or days on end.
Similarly, if you take work output divided by calendar time, you can deliver 110%... for a while. If you try to sustain this,
OpenBSD is 2038-ready (Score:4, Informative)
Re: (Score:2)
So Is Mac OS X. (Score:5, Informative)
So Is Mac OS X.
I converted time_t to 64 bits on 64 bit systems (which include the most recent iPhones) as part of the changes for 64 bit binary support on the G5 when I wrote the 64 bit binary loader support into exec/fork/spawn, and again as part of UNIX Conformance. It's basically been fixed since Tiger.
Re: (Score:2)
OpenBSD now has a 64-bit time_t on 32-bit systems. time_t was always 64-bit on 64-bit systems on OpenBSD.
IMO this is one of the things that is being mostly ignored. Back around 2000, many people were saying things along the lines of, "we'll all be on 64bit or larger systems by 2038, so it will solve itself". Many more people have ignorantly joined that line of thought, since almost all mainstream cpu's are 64bit now. That said,there are still a large number of 32bit cpu's being made (like almost every android device CPU there is, and most Apple iPhone/iPad things, and many of the chromebooks out there):
All ARM
Comment removed (Score:5, Funny)
Volunteers (Score:3, Insightful)
Re: (Score:3)
Re: (Score:1)
Specific installations that don't upgrade might have some problems, but most of those systems won't last another couple of decades and will require replacement sooner. Specific in-house software that was compiled 32 bits and the the source lost might also have problems.
And also binary protocols where a timestamp is sent using 32bits on the wire. NTP for example. I'm sure there're dozens of others. Public ones should be fixed by 2038. In-house/proprietary ones could conceivable get missed if it's not well maintained or no-one knows the details of the protocol well enough to realise it's affected.
NTP is actually a little odd in that it uses 32 unsigned bits of 64bits as seconds as seconds since 1st January 1900 so it's not a unix epoch. The result is NTP is affected 2 years
Re:Volunteers (Score:4, Insightful)
Given that so much of the non-GNU/Linux code is written by paid programmers I wonder who it is exactly that is going to fix all the code. I mean back when it was written Computer programming was much less of a gold rush. Nowadays everyone is competing for jobs that pay $120,000. Who is willing to pay programmers to go through all of the old code to fix it.
It's really not an issue. It's already fixed in OpenBSD. Certainly there's some user space code that also counts seconds since 1970 but if folks would simply start now there's no future fix necessary. The set of code written today which will be in use in 2038 will be vanishingly small. The remaining folks will pay some gray hair to knock it into shape. Missed code will make itself apparent sometime that Tuesday morning.
Re: (Score:2)
The higher the pay, typically, the more engaging the work. But there are far more programmers working on dull stuff for 60k/year that could probably use the hobby lets their brains turn to mush.
We encountered a similar bug (Score:2)
Re: (Score:2)
Why didn't/couldn't you use GMT?
Re: (Score:2)
Why didn't/couldn't you use GMT?
Good question. We wondere tht ourselves why the idiots that programmed it used local times. Since they probably never operated a piece of equipment in their life they probably assumed we'd want local time but never asked; which illustrates the classic user / developer disconnect. Years later while on. Control room design project I had to tell developers that the all digital panel design they were so proud of was interesting, cool, futuristic and totally useless for actually operating a plant.. As a result
Re: (Score:2)
Cobol (Score:2)
There is no chance whatever of the code being replaced if its working now, because no one will sign off a replacement if it still works.
Re: (Score:2)
I know this goes against the myth of untouchable Cobol code hidden away that no one dares even look at - but bit by bit it is being replaced (by C++, java, C# whatever). Or at least in the companies I've worked in it was. One small sub section at a time with plenty of testing and a proper rollback plan.
Assuming "american" programmers (Score:2)
Yes, the average age of an American or Russian COBOL programmer will be 80 or over by 2038... However, you're discounting the Filipinos...
Most of those kids are in their late 20's at most, so they'll be around in 2038. Assuming our Mainframe isn't phased out due to budget cuts, I suspect that the code written in 1980 will still be maintained and running well past 2038.
Why rollover? (Score:3)
Isn't mouseover the modern term?
Re: (Score:2)
Can you even buy mice with balls anymore?
Windows one is my fave (Score:3, Informative)
There was a counter in Windows that rolled over after 28 days I think (like the 787 bug, but 1000 ticks.second not 100).
Even Microsoft knew that no Windows box could stay up that long.
(And before you mod me as a troll, think about it and know that MS could have made a bigger counter, but didn't feel the need to)
Re:Windows one is my fave (Score:4, Informative)
The version of Windows was Windows 95, and the number of days was 49.7.
https://support.microsoft.com/... [microsoft.com]
Re: (Score:1)
It was Windows 95 and 98, and the rollover happened at 49.7 days.
And yes, you are a troll because it's quite easily explained as a garden variety mistake due to careless programming. An unsigned 32 bit integer can hold up to 4 billion. 4 billion milliseconds is about 49.7 days. 4 billion sounds "big enough"-- but it isn't when we're talking milliseconds. And clearly, a Windows box COULD stay up that long, or else the bug would never have been discovered.
Re: (Score:1)
2038 is working itself out already (Score:3)
Re:2038 is working itself out already (Score:4, Insightful)
In the business I work "profibus" is considered a "new" technology. The standard was published in 1989.
We still run a token ring coax network for most critical systems on a significant part of the oil rigs in the North sea and on onshore installations supporting them.
Some of the controllers are 20 years old and just milling along happily. We did a replacement of NVRAM recently and that is all the service the modules need.
I fully expect this crud to still be in use in 20 years. Conservative bastards >.
Re:2038 is working itself out already (Score:5, Insightful)
If the hardware is still fully operational after 20 years in a hostile enviroment like an oil rig I'd say its anything but "crud". It was probably some of the best kit on the market.
This might come as a shock but a lot of businesses want kit that Just Works reliably 24/7, not the latest trendy junk that would impress a Hipster cycling past on his fixie bike but lasts about 5 minutes in the real world.
Re:2038 is working itself out already (Score:5, Interesting)
Oh it is good gear, but the list of 'bugs' and 'erratas' on the gear is growing longer and longer for every month it stays in service. Spare parts are almost impossible to come by, and even the toolchain needed to update the programs are old enough to require special dedicated workstations.
It is not a matter of 'working' it is a matter of 'will work in the future'. Right now all the gear has reached "end of life" and spare parts are very close to being "ebay if you're lucky" in terms of procurement. Trying to get the customer to upgrade BEFORE we're already screwed and have to 'rush' an upgrade is the game we're in now.
Doing a 3 year project in 6 months (while in some cases doable..) leads to badly rushed design and future redesigns. We've seen this over and over in the past 10 years.
An example is that the new hardware has built in EX barriers on each channel, the termination boards are much better and a variety of other improvements. This translates into -4- massive cabinets being reduced to one. Real-estate offshore is hugely expensive and this would save staggering amounts of money compared to expanding equipment rooms... but they want the stuff they're used to, not the stuff that is current.
The hilarity of the whole thing is that the 'current' stuff is now installed all over the rig where old hardware is not available so now we have both systems running in parallel with a ton of 'interfacing' and single points of failure introduced as a result.
It can drive an engineer mad.
Re: (Score:2)
Oh.. and the environment is not very hostile. Everything is fully battery backed, fully environmentally shielded and there are virtually no vibrations reaching the room.
Hell, after 20 years in operation the room hardly has dust anywhere. The controllers look brand new when inspected.
I love working with the system as is, but trying to shoe-horn the new system requirements into the existing hardware is tricky at best. We're running all our data over a 2mbit token ring network.
Re: (Score:2)
Yeah, but it's now unsupported kit and who knows if there are rollover issues? It already ran 20 years, so it's conceivable it will run another 20+ years and hit the 2038 bug, then what? And catching this bug is a lot more subtle than the y2k bug.
We've already run into rollover issues - on an old processor board that people are
Re: (Score:2)
This could be a problem when running simulation software. You set the date to 2040 and bam your VM crashes.
Re: (Score:2)
Ah, but signed 32-bit dates have the problem. A quick change to unsigned 32-bit fields extends this from 2038 to 2106.
Of course, I can hear the screaming even from here, everyone is mandated to use 64-bits! Except that this is not practical thing in many contexts, when you're on a computer with not enough speed or ram. Or when the date doesn't really matter as it's used in a safer context. Or just admit up front that the system is not POSIX (which they very often are not on small embedded machines), and
Have fun with that... (Score:1)
...because I'm going to be in a fucking jar on a shelf in 2038.
Re: (Score:2)
Y2K was -not- a small issue (Score:5, Insightful)
It annoys me to see Y2K trotted out time and time again as a non-event. It was a very big event, and by the large part it was very successfully handled.
Re: (Score:2)
Had we left it all to 1st Jan 2000
I burned a lot of midnight oil during the month of January, 19100.
A Trivial Issue (Score:1)
Are there programmers who, in their cleverness, have use primitive code that still relies on the older base date without reference to the underlying O.S.? Sure. But, change the base date soon, and all their bugs will appear LONG
Re: (Score:1)
You're the kind of guy who knows just enough to be incredibly dangerous.
The epoch is baked into so many algorithms that only a complete idiot would consider changing it. It's also standardized by POSIX, which is why so many algorithms rely on it for calendar manipulation. In POSIX each day is precisely 86400 "seconds" (regardless of leap second), which makes calendar computation easy, without resort to a complex library.
The proper thing to do is simply change the type of time_t to 64-bits, even on 32-bit sy
Re: (Score:2)
Case Study and Analysis of Ariane 5 .. (Score:2)
The 2038 bug may show up early (Score:2)
Thanks to the math required for date conversion, the 2038 bug may actually show up a couple of years early. How do I know? I tried setting the clock forward in an embedded system I wrote the code for. Its calendar actually seems to fail in 2036. I haven't tried it in a while, but I think I can't even set the date past January 2036. I didn't try to figure out exactly why it failed earlier than it should have, because the library code looks pretty messy.
It's using the standard date library stuff from the IAR