AMD Confirms CPU Bug Found By DragonFly BSD's Matt Dillon

AMD Confirms CPU Bug Found By DragonFly BSD's Matt Dillon 292

Posted by Soulskill on Tuesday March 06, 2012 @01:14AM from the it's-not-me-it's-you dept.

An anonymous reader writes "Matt Dillon of DragonFly BSD just announced that AMD confirmed a CPU bug he found. Matt quotes part of the mail exchange and it looks like 'consecutive back-to-back pops and (near) return instructions can create a condition where the processor incorrectly updates the stack pointer.' The specific manifestations in DragonFly were random segmentation faults under heavy load."

AMD Confirms CPU Bug Found By DragonFly BSD's Matt Dillon

This discussion has been archived. No new comments can be posted.

Search 292 Comments Log In/Create an Account

Comments Filter:

This isn't nearly as bad as the division bug (Score:5, Insightful)

by Omnifarious ( 11933 ) * writes: <eric-slash@omnif ... g minus language> on Tuesday March 06, 2012 @01:24AM (#39257895) Homepage Journal

Though, it's still very serious. At least it generally causes your program to crash rather than spitting out a wrong answer. And it sounds like the sequence of instructions that causes it is not commonly found.
I can well understand the guy who found it being all excited. The CPU is the last place you'd look for a bug, and finding one is pretty impressive, especially a really elusive one like this.

Re:another horrible cpu bug (Score:5, Insightful)

by Taco Cowboy ( 5327 ) writes: on Tuesday March 06, 2012 @01:31AM (#39257957) Journal

What has Taiwan got to do with this ?
I mean, was the CPU bug somehow introduced by TSMC ?

Re:This isn't nearly as bad as the division bug (Score:5, Insightful)

by XDirtypunkX ( 1290358 ) writes: on Tuesday March 06, 2012 @01:33AM (#39257977)

Either are equally bad from the perspective of a software developer who spends a month trying to work out just exactly what is wrong with their code, especially if something like this occurs on a test machine but not on a development machine.

Re:cool, but...? (Score:4, Insightful)

by ffflala ( 793437 ) writes: on Tuesday March 06, 2012 @01:40AM (#39258027)

It matters because it's impressive. It also seems fair to associate some of the positive impression with DragonflyBSD, and I cannot see any downside to throwing good PR at any BSD flavor.

Re:This isn't nearly as bad as the division bug (Score:5, Insightful)

by sjames ( 1099 ) writes: on Tuesday March 06, 2012 @01:44AM (#39258059) Homepage Journal

Crash bugs are frustrating, but nowhere NEAR as scary as a bug that results in an incorrect but plausible computation. If the program crashes, you KNOW it crashed and you know the runs before that didn't crash are OK.
Note that IRL the two cases can overlap. That is, a bug that might trigger a crash or might trigger an incorrect computation that might be plausible depending on luck of the draw.

Kudos (Score:5, Insightful)

by Mannfred ( 2543170 ) writes: <mannfred@gmail.com> on Tuesday March 06, 2012 @01:44AM (#39258063)

I can only imagine the time and effort spent on tracking down this problem - a rare CPU condition is exponentially more difficult to narrow down than most programming mistakes. A lot of progress in IT depends on engineers like this, who obsessively solve problems even when it's much easier to just ignore them, try to hack around them or pass the buck around. Kudos.

Re:This isn't nearly as bad as the division bug (Score:5, Insightful)

by Anonymous Coward writes: on Tuesday March 06, 2012 @01:52AM (#39258117)

Floating point operations are never fully precise. Simple numbers such as 4.0 would be represented as 4.0000000000000213 or 3.99999999999973 if you arrive at this after doing a bunch of calculations.
This is an inherent limitation of how floating point works, and not something that has been "fixed". Programmers still have to worry about this.

Re:This isn't nearly as bad as the division bug (Score:5, Insightful)

by synthesizerpatel ( 1210598 ) writes: on Tuesday March 06, 2012 @01:59AM (#39258163)

If your program is 'the kernel' then that qualifies as 'as bad as the division bug' && 'it's a big deal'.

Re:This isn't nearly as bad as the division bug (Score:5, Insightful)

by sjames ( 1099 ) writes: on Tuesday March 06, 2012 @02:31AM (#39258319) Homepage Journal

Imagine, there is a tiny bug that makes your floating point results just slightly wrong once in 1000 times. You run an iterative dynamic simulation of a bridge under load that runs for a million cycles. The results LOOK right...

Re:This isn't nearly as bad as the division bug (Score:4, Insightful)

by Darinbob ( 1142669 ) writes: on Tuesday March 06, 2012 @03:04AM (#39258485)

CPUs have plenty of bugs. It's not necessarily the last place to look, especially for less popular processors. The only reason it's rarer with Intel and Intel-copying CPUs is because the market is so much bigger and therefore the resources for QA. Actually the bigger and more complex the processors are becoming the more likely it is to have bugs. Of course most are things people don't worry about or that can be worked around by following advice in the errata.
In fact enough people assume CPUs have bugs only in the rarest of cases makes it hard to convince others that you have actually found a bug that's not in the errata. The same thing happens with compilers, you tell people that the bug must be in the compiler and they roll their eyes at you.

Re:This isn't nearly as bad as the division bug (Score:4, Insightful)

by phantomfive ( 622387 ) writes: on Tuesday March 06, 2012 @04:07AM (#39258855) Journal

Just because you find an error in a division when you were programming your MMORPG in visual basic doesn't mean you've found the pentium bug. If you noticed it happening a lot, it probably wasn't the bug, just normal IEEE precision issues.

Re:This isn't nearly as bad as the division bug (Score:2, Insightful)

by Anonymous Coward writes: on Tuesday March 06, 2012 @04:12AM (#39258883)

> It came for free
You meant QBasic.
QuickBasic is for money.

Re:This isn't nearly as bad as the division bug (Score:4, Insightful)

by wvmarle ( 1070040 ) writes: on Tuesday March 06, 2012 @04:37AM (#39259003)

Google is known to build their servers from cheap parts.
Like a RAID, but then a RAIS (Redundant Array of Independent Servers). Load distribution may be an issue as it has to seamlessly reassign tasks when a server is down for whatever reason. But for sufficiently large operations (five servers or more) this sounds to me like the way to go. Instead of trying to make every individual server highly reliable, go with the still very reliable user-grade stuff and get your reliability by redundancy. And companies like Google need more than one server anyway.

Re:you are mistaken (Score:4, Insightful)

by Rockoon ( 1252108 ) writes: on Tuesday March 06, 2012 @05:07AM (#39259111)

512bit calculations aren't that expensive
Yes they are.

Re:you are mistaken (Score:5, Insightful)

by neokushan ( 932374 ) writes: on Tuesday March 06, 2012 @05:48AM (#39259225)

Except I very much doubt that would solve whatever "problems" this guy was having. As a newbie programmer, it's entirely understandable that he wouldn't know about the fun you can (or can't) have with floating point operations. However, I very much doubt that sheer accuracy was the issue, rather he was probably making assumptions such as 1.0 - 1.0 == 0.0, when in reality the result isn't necessarily exactly 0.0. Considering it's an MMO, he probably had something like "Why is this guy not dying, he has 4 HP left and this attack does exactly 4 damage? Must be a bug!".
Really, it doesn't matter a huge amount, if such "accuracy" is important to your game then instead of doing "if(Health is less than 0.0) /* die */", you do something like "if (Health is less than 0.0 + epsilon) /* die */", with "epsilon" being a very small number (such as 0.00000001).
The real fun with floats, however, is that each platform does something different. It's possible that the OP ran the game on Intel hardware and got one result (which may have seemed more "correct"), then ran it on an AMD machine and got a different (seemingly less-correct) result - you can see why he naturally jumped to the conclusion that the AMD system had a bug.
In reality, chances are both systems were "wrong" anyway, they just happen to use different implementations for floating-point logic. To solve this, once again higher rates of calculations aren't the answer, but rather there's a compiler switch (/fp:strict in VS) that will use the ISO standard floating point model. It's not as fast as the other methods, but you will at least game the same results across different platforms (assuming that CPU has implemented the standard correctly which these days is almost certain).
There's LOTS of fantastic info on this here: http://gafferongames.com/networking-for-game-programmers/floating-point-determinism/ [gafferongames.com]

Re:This is why I use Intel, Quality (Score:5, Insightful)

by m.dillon ( 147925 ) writes: on Tuesday March 06, 2012 @01:20PM (#39262923) Homepage

Intel has had quite a few serious chip bugs too, all in errata. A number of new cpu bugs in both AMD and Intel chips always appears in new generations, but both companies have very large test suites and the number of new bugs goes down in every generation.
Don't forget that Intel had to recall a sandybridge chipset early in the sandybridge cycle, which cost them something like a billion dollars because the related motherboards had to be thrown away and replaced. That was due to internal on-chip circuitry related to a SATA port burning out.
Right at this moment AMD has two issues facing it in order to compete on workstations: (1) Power and (2) Performance. Their initial bulldozer release clearly depends too much on compiler optimizations to make full use of the architecture. They will clearly have to bulk-up some of the simplifications they made that made their cpu cores a little too sensitive to instruction sequences generated by compilers and I hope their next few releases will do better.
On power consumption it comes down to the Fab as much as anything else. Their dependence on the Fab is clearly a problem and they've made a break for it to try to solve it, even though it is costing them dearly. At the same time Intel has made some major advances in their three fabs, to the point where Intel can do their entire production on just two of those three fabs now but they decided to keep the third fab because they think they can 'grow into' it.
So AMD definitely has some work ahead of it, and I am hoping they reserve some of their focus for the high-end and don't concentrate entirely on laptops. I always like to say that I love AMD, but in the stock market I invest in Intel. That's just business. But I got on the AMD bandwagon big-time when they got to 64-bit first and I stuck with them all the way through the Phenom II.
Now, at this moment, Intel's SandyBridge has the best value and AMDs bulldozer is quite far behind, so new purchases for me right now are Intel. That may change in the next year or two and when it does my new purchases will happily be in the AMD camp again. Frankly, AMD only has to get within shouting distance (~8%) of Intel and I will happily use AMD. AMD doesn't have to beat Intel.
I think there are a number of things AMD can do right now to compete better with Intel. One of the biggest is in the mini-server department (albeit clearly with lower volumes than their current focus on laptops & integrated graphics). AMD consumer cpus (aka Phenom II) always had ECC support but very few motherboards actually supported it, which made it difficult to use AMD for mini-servers and avoid the Intel Xeon tax to get ECC. If AMD worked on the mobo vendors to ALWAYS support an ECC option that would allow them to compete against Intel Xeons on price, even if they are unable to compete on performance.
On the opterons AMD clearly has the right idea going with high-core-count cpus, but the memory subsystem is lagging too much to really be able to make use of all those cores. That seems to be low-hanging fruit to me, something which should be readily addressable by AMD. The opterons still have a lot of value and potentially can have a radical improvement in value with Bulldozer, but only if AMD can push the core count and improve the memory subsystem.
On large multi-core boxes AMD also needs to improve CMPXCHG and other atomic instructions in situations where contention is high. Right now multi-chip opteron systems seriously lag Intel on contended latency due to cache coherency inefficiencies. Will Bulldozer fix those latency issues? I don't know.
AMD only needs to get within shouting distance of Intel for me to buy their chips, and work their mobo producers a bit more to get better overall support for their chip's capabilities. They don't have to beat Intel.
-Matt

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

AMD Confirms CPU Bug Found By DragonFly BSD's Matt Dillon 292

AMD Confirms CPU Bug Found By DragonFly BSD's Matt Dillon More Login

AMD Confirms CPU Bug Found By DragonFly BSD's Matt Dillon

This isn't nearly as bad as the division bug (Score:5, Insightful)

Re:another horrible cpu bug (Score:5, Insightful)

Re:This isn't nearly as bad as the division bug (Score:5, Insightful)

Re:cool, but...? (Score:4, Insightful)

Re:This isn't nearly as bad as the division bug (Score:5, Insightful)

Kudos (Score:5, Insightful)

Re:This isn't nearly as bad as the division bug (Score:5, Insightful)

Re:This isn't nearly as bad as the division bug (Score:5, Insightful)

Re:This isn't nearly as bad as the division bug (Score:5, Insightful)

Re:This isn't nearly as bad as the division bug (Score:4, Insightful)

Re:This isn't nearly as bad as the division bug (Score:4, Insightful)

Re:This isn't nearly as bad as the division bug (Score:2, Insightful)

Re:This isn't nearly as bad as the division bug (Score:4, Insightful)

Re:you are mistaken (Score:4, Insightful)

Re:you are mistaken (Score:5, Insightful)

Re:This is why I use Intel, Quality (Score:5, Insightful)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot