AMD Confirms CPU Bug Found By DragonFly BSD's Matt Dillon 292
An anonymous reader writes "Matt Dillon of DragonFly BSD just announced that AMD confirmed a CPU bug he found. Matt quotes part of the mail exchange and it looks like 'consecutive back-to-back pops and (near) return instructions can create a condition where the processor incorrectly updates the stack pointer.' The specific manifestations in DragonFly were random segmentation faults under heavy load."
Re:Microcode patch (Score:4, Informative)
I'm wondering if they will. This seems like a very odd timing issue that may be a problem in the electronics. Of course, I suppose they could just put in some microcode to wait after certain operations to make sure things settle and so avoid the hardware bug.
Re:This isn't nearly as bad as the division bug (Score:4, Informative)
And it sounds like the sequence of instructions that causes it is not commonly found.
Really?
Pop two off the stack and ret to the calling routine seems fairly common to me. Lots of functions use two arguments and are called with near calls in various programming languages.
Affected CPUs (Score:5, Informative)
A pertinent addition to the submission would be which CPUs have been found to be affected.
The second link says Opteron 6168 and Phenom II X4 820. For a second I thought that bulldozer hasn't managed to do anything right, but these two examples are pre bulldozer.
No doubt this is not an exhaustive list.
Confirmed CPUs (Score:5, Informative)
FWIW:
The failure has been observed on three different machines, all running AMD cpus. A quad opteron 6168 (48 core) box, and two Phenom II x4 820 boxes.
Re:you are mistaken (Score:5, Informative)
A floating point precision error. Floating points cannot represent quite a diverse collection of numbers, this is especially problematic when you're doing intersections with small objects. Say a ray projected from an object will, because of the minute errors in floating point, collide with the same object (which produces some cool patterns).
Floating points are kind of crappy. Not that I have a better option with viable performance on a desktop machine. That's not a division bug, that's just the nature of representing numbers in binary with a fixed number of bits.
Re:This isn't nearly as bad as the division bug (Score:5, Informative)
Anyone who's programmed long enough has found unexplainable bugs that are eventually traced down to some bad hardware. :)
I've preferred AMD over Intel for years. Long ago, in a distant computer store, far away.... We sold 386s, 486s, and Pentiums (or their reasonable clone) from Intel, IBM, AMD, and Cyrix. At the time, I didn't really care who made the chip, they were just built out for the customer.
Over the years, I learned to prefer AMD for both the price and performance. Plenty of people will argue "but this Pentium is faster than that AMD". Well, it's all nice, but I don't *have* to stay bleeding edge. I never liquid cooled my CPU, video card, and memory. Friends did. I was always impressed with how much they wasted. I'd just wait 6 months or so, and get something better, faster, and cheaper. :) I do like having a high performance computer, so I upgrade every year or so.
For example, I just set up a couple servers from COTS parts. They used AMD FX-8120's (8 core, 4.0Ghz turbo) for $199.99/ea. It seems the comparable Intel is the i7-980 (6 core, 3.6Ghz), which is selling at $589.99. For the difference in price, I could build out a 3rd server, and still have money left over. Toms hardware suggests the i5-2500K (4 core, 3.7Ghz turbo) for $224.99 or i7-2600K (4 core 3.8Ghz turbo) for $324.99 as comparable. If I wanted to spend a little more, I could have gone with the AMD FX-8150 (8 core, 4.2Ghz turbo) for $249.99. Was $50 for .2Ghz worth it? Not really. Something bigger, better, and faster will be out next year, and the year after, and then I'll buy something new.
I used newegg.com for all the prices, so it would be fairly even.
The servers actually use as many cores as I can throw at them, so it's extremely beneficial to have more cores at high speeds.
My desktop/gaming machine still has a Phenom IIx6 1100T in it. All the games I play, I can leave all the settings turned all the way up. Maybe if I ran benchmarks, I'd see something else gets a slightly faster frame rate, but I can't see any difference. As we all know, various benchmarks show different things.
Re:This isn't nearly as bad as the division bug (Score:4, Informative)
What is the problem with Quick Basic? It came for free and it was quite ok.
No network access? Might be fine for you, but for an MMORPG programmer on the other hand...
Re:This isn't nearly as bad as the division bug (Score:3, Informative)
What is the problem with Quick Basic? It came for free and it was quite ok.
NO it did NOT. What came free was QBasic, which was a stripped-down version of Quick Basic. The full Quick Basic did not have the 640k memory limitation, was able to fully link/compile stand-alone executables, and had a host of other Professional features that QBasic lacked.
Don't get me wrong- QBasic was great for a free environment (at the time). But it was severely limited, and all the references to "quick basic" in this thread appear to be referring to shortcomings in QBasic, which were not present in Quick Basic.
Re:This isn't nearly as bad as the division bug (Score:5, Informative)
We are not talking about the start of the function, but the end.
Who is this "we" .. are you, the anonymous coward, teamed up with icebike (68054)? Clearly you shouldn't be, since he most definitely stated his belief that a two-parameter function would pop its two input parameters near its final return statement.
anything the function has pushed onto the stack will be on the top of the stack, before the return functions.
What you are saying is not news to me. The problem with your argument is that in the x86-64 calling conventions (which is what the article is talking about) there are plenty of volatile registers to use. To be specific there are 7 general purpose registers (64-bit) as well as 6 SSE registers (128-bit) that are considered volatile. If a function really uses so many registers that it requires saving a few of the non-volatile registers, then the function is also most often going to be so non-trivial that it must maintain 16-byte stack alignment.
Only leaf functions can safely violate the 16-byte alignment rule and are allowed to push and pop willy-nilly, but leaf functions also dont need non-volatile registers themselves because they arent calling anything that might destroy the registers they use. So we are talking about a very narrow situation where the function is (a) A leaf function and (b) Takes many parameters (more than 4, certainly) in order to create the register pressure required to need to spill some of them onto the stack someplace other than the mandatory scratch stack space (for the first 4 arguments) required by the calling convention.
My lawn...
Re:Affected CPUs (Score:5, Informative)
Does that mean that the kernel uploads the new microcode on boot ? How does it get it ?
The microcode module loads the microcode for the cpu from /lib/firmware/amd if it's newer than the one on the cpu. You can download and place new microcode updates from amd in this directory if needed or just let your distro provider update the microcode files when they push new packages out.
Re:This isn't nearly as bad as the division bug (Score:4, Informative)
Re:you are mistaken (Score:5, Informative)
I'm pretty sure it was with the introduction of the Pentium (which had the famous FDIV bug) that John Carmack officially made the switch to single precision FP for most things because it was finally fast enough. FP wasn't cheap, per se, but the simplification it brings over keeping track of binary points and precision/range tradeoffs in integerized algorithms should not be underestimated either.
For example, if I want to do a floating point multiply and add, I just say: f3 = f0 * f1 + f2. Before I even start writing a fixed-point multiply and add, I need to ask what the Q points (binary points) are for each of the terms, what Q point you'd like for the result, and what sort of rounding (if any) the result requires for stability. You can end up with a monstrosity like this, assuming all four numbers are at the same Q point:
x3 = (int)(((long long)x0 * x1 + (1LL > Q) + x2;
Ok, maybe you hide that behind a macro, but what about cases where some of the terms are at different Q points? A fully general macro (which is no fun to write, BTW) would also have a ton of arguments, and only reduce you to something like x3 = FXMULADD(x0, Q0, x1, Q1, x2, Q2, Q3); which won't win you any awards in the clarity department.
And look at the operations themselves, too. You have type promotion, extra adds and shifts... the instruction sequence itself isn't super efficient. It pays off when floating point takes 10s and 100s of cycles, but is a dubious win when most of the core FP starts coming down into the single digits. With the Pentium's dual pipes and the fact you could keep integer instructions flowing in parallel to the float, that's effectively what happened. And notice we haven't even talked about dynamic range and overflow errors and how they screw you up. If you have to add tests for that... yuck. With floating point, you degrade gracefully if your dynamic range spikes a little higher than you expect.
Anyway, getting back on topic: This isn't the first time an x86 has had a stack-pointer related bug. I remember the 80386s that had the so-called "POPAD bug". [indiana.edu] That one was a bit easier to hit.
Hopefully, AMD will be able to publish a microcode update or something to work around theirs. That's one thing modern x86s have over their predecessors: A good number of CPU bugs can be patched around with microcode updates. I believe Intel added that with the Pentium Pro, and AMD followed suit. I believe my Phenom is one of the affected parts. I guess I'll have to keep an eye out for such a patch.
Bulldozer not effected. (Score:5, Informative)
AMD has indicated to me that the Bulldozer is not effected, which is a relief.
I guess I should have realized this would get slashdotted. In anycase, it took quite a bit of effort to track the bug down. It was very difficult to reproduce reliably. It isn't a show stopper in that it really takes a lot of work to get it to happen and most people will never see it, but it's certainly a significant bug owing to the fact that it can be reproduced with normal instruction sequences.
I began to suspect it might be a cpu bug last year and after exhaustive testing I posted my suspicions in December:
http://leaf.dragonflybsd.org/mailarchive/kernel/2011-12/msg00025.html [dragonflybsd.org]
Older versions of GCC were more prone to generate the sequence of POP's + RET, coupled with a deep recursion and other stack state, that could result in the bug. It just so happened that DragonFly's buildworld hit the right combination inside gcc, and even then the bug only occurred sometimes and only one a small subset of .c files being compiled (like maybe 2-3 files). The bug never manifested anywhere else, doing anything else, running any other application. Ever.
In particular the bug disappeared with later versions of GCC and disppeared when I messed with the optimizations. We use -O by default, not -O2. The bug disappeared when I produced code with gcc -O2 (using 4.4.7).
It is really unlikely that Linux is effected... the sensitivity to particular code sequences laid out in the compiler is so fine that adding a single instruction virtually anywhere could make the bug disappear. Even just shifting the stack pointer a little bit would make it disappear.
In anycase, for a programmer like me being able to find an honest-to-god cpu bug in a modern cpu is very cool :-)
-Matt