Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
AMD Bug BSD

AMD Confirms CPU Bug Found By DragonFly BSD's Matt Dillon 292

An anonymous reader writes "Matt Dillon of DragonFly BSD just announced that AMD confirmed a CPU bug he found. Matt quotes part of the mail exchange and it looks like 'consecutive back-to-back pops and (near) return instructions can create a condition where the processor incorrectly updates the stack pointer.' The specific manifestations in DragonFly were random segmentation faults under heavy load."
This discussion has been archived. No new comments can be posted.

AMD Confirms CPU Bug Found By DragonFly BSD's Matt Dillon

Comments Filter:
  • Though, it's still very serious. At least it generally causes your program to crash rather than spitting out a wrong answer. And it sounds like the sequence of instructions that causes it is not commonly found.

    I can well understand the guy who found it being all excited. The CPU is the last place you'd look for a bug, and finding one is pretty impressive, especially a really elusive one like this.

    • by XDirtypunkX ( 1290358 ) on Tuesday March 06, 2012 @12:33AM (#39257977)

      Either are equally bad from the perspective of a software developer who spends a month trying to work out just exactly what is wrong with their code, especially if something like this occurs on a test machine but not on a development machine.

      • by GoodNewsJimDotCom ( 2244874 ) on Tuesday March 06, 2012 @12:40AM (#39258019)
        I found out about the division bug as a beginner programmer! I was trying to write the first MMORPG using Quick Basic. I remember division not being exactly accurate, so the solution I needed to use was to round up and down results that are really close. It fixed it, but new programmers shouldn't be forced to deal with stuff like that.

        I've preferred AMDs to Intels because AMD was one of the first sponsors to Esports back in 99. Too bad Columbine happened and I suspect they wanted to distance themselves from Quake tournaments. Another thing I like about AMD was that their processors don't melt if they get hot because they have a self preservation shutdown mode. People said Intel had this, but I melted a processor just a few months ago on SWTOR.
        • by Smauler ( 915644 ) on Tuesday March 06, 2012 @12:48AM (#39258083)

          I was trying to write the first MMORPG using Quick Basic.

          Sounds like the division bug was the least of your problems....

          • by GoodNewsJimDotCom ( 2244874 ) on Tuesday March 06, 2012 @02:08AM (#39258503)
            Heh. I coded a nice tile based RPG out of it, but I couldn't make it MMOG because there is no socket code in Quick Basic. The trick to making big games in Quick Basic is to write your own Virtual Disk so you can get past the 640k memory limit. Once you have a virtual disk, you can write an interpreted language inside Quick Basic, then your code is simply loaded up in a custom database. I rewrote the whole thing in C/C++ because people told me I could get socket libraries in it, but I gave up on my game entirely when Ultima Online came out because I felt I wouldn't be able to build up a market because my graphics are so bad. I was partially right in thinking there is only enough room for one MMORPG at a time back in 97, but I think I shouldn't have gave up after having coded for thousands of hours with things like Farmville succeeding today.
        • by Corbets ( 169101 ) on Tuesday March 06, 2012 @12:50AM (#39258097) Homepage

          I found out about the division bug as a beginner programmer! I was trying to write the first MMORPG using Quick Basic.

          I've never heard "choosing the wrong programming language" described as a bug, but hey, however you want to play it off, man.

        • by Anonymous Coward on Tuesday March 06, 2012 @12:52AM (#39258117)

          Floating point operations are never fully precise. Simple numbers such as 4.0 would be represented as 4.0000000000000213 or 3.99999999999973 if you arrive at this after doing a bunch of calculations.

          This is an inherent limitation of how floating point works, and not something that has been "fixed". Programmers still have to worry about this.

          • by dalias ( 1978986 ) on Tuesday March 06, 2012 @08:52AM (#39260271)
            This is not insightful; it's wrong. Floating point on any modern system conforms, or at least is intended and assumed to conform, to IEEE 754. There are exact answers specified for every basic arithmetic operation and non-transcendental functions. Of course there are decimals that have no representation in binary, but 4.0 is not one of them.
        • I remember division not being exactly accurate, so the solution I needed to use was to round up and down results that are really close.

          Just so you know, division is never accurate in floats, even when the CPU doesn't have bugs. If you're using doubles you'll get better accuracy, but with a 32 bit floating point number, you shouldn't be surprised to find errors in the third digit after the decimal point.

          • I remember division not being exactly accurate, so the solution I needed to use was to round up and down results that are really close.

            Just so you know, division is never accurate in floats, even when the CPU doesn't have bugs. If you're using doubles you'll get better accuracy, but with a 32 bit floating point number, you shouldn't be surprised to find errors in the third digit after the decimal point.

            Just to be clear its not limited to division. Hell, errors can creep in just by converting a decimal number to floating point. This is why calculators use decimal arithmetic, well some of them - like Perpenso Calc for iPhone iPad [perpenso.com] RPN Scientific Stats Business Hex. Try "0.5 - 0.4 - 0.1" in your favorite calculator app, it might indicate whether the app is using the FPU or decimal arithmetic. Of course the app may be doing something naive like the "BASIC MMORG", rounding results. Its naive because it is anoth

            • Division is division, regardless of the base used. The issue is that in base 10 (aka decimal numbers), division by 2 and 5 always comes out to a finite decimal; in binary numbers only division by 2 comes out to a finite decimal. Dividing by any primes other than 2 and 5 (and numbers involving those primes) will require rounding in both bases (and they may not necessarily round the same way). That is, unless you're only dividing by combinations of 2 and 5, there really is no preferred base.

              The main problem w

              • Multiplying and dividing are the least of your worries in floating point. Adding and subtracting are where the real problems happen.

                eg.

                float a = 0.1;
                float b = 0.2;
                if (a == b) {
                    print("Before the add, a is equal to b");
                }
                float c = 10000000;
                a += c;
                b += c;
                if (a == b) {
                    print("After the add, a is equal to b");
                }

                What's the output?
                What happens if you multiply by c instead of adding it?

          • Just so you know, division is never accurate in floats, even when the CPU doesn't have bugs.

            What kind of lather are you people working up? The subject was, a division bug. Out of spec operation. Not normal IEEE precision issues.

          • with a 32 bit floating point number, you shouldn't be surprised to find errors in the third digit after the decimal point.

            So working with millimeters is completely impossible then?

            Bummer. There goes my plan to write a CAD system using metric measurements.

          • Just so you know, division is never accurate in floats

            Um, yes it is.

        •     Anyone who's programmed long enough has found unexplainable bugs that are eventually traced down to some bad hardware. :)

              I've preferred AMD over Intel for years. Long ago, in a distant computer store, far away.... We sold 386s, 486s, and Pentiums (or their reasonable clone) from Intel, IBM, AMD, and Cyrix. At the time, I didn't really care who made the chip, they were just built out for the customer.

              Over the years, I learned to prefer AMD for both the price and performance. Plenty of people will argue "but this Pentium is faster than that AMD". Well, it's all nice, but I don't *have* to stay bleeding edge. I never liquid cooled my CPU, video card, and memory. Friends did. I was always impressed with how much they wasted. I'd just wait 6 months or so, and get something better, faster, and cheaper. :) I do like having a high performance computer, so I upgrade every year or so.

              For example, I just set up a couple servers from COTS parts. They used AMD FX-8120's (8 core, 4.0Ghz turbo) for $199.99/ea. It seems the comparable Intel is the i7-980 (6 core, 3.6Ghz), which is selling at $589.99. For the difference in price, I could build out a 3rd server, and still have money left over. Toms hardware suggests the i5-2500K (4 core, 3.7Ghz turbo) for $224.99 or i7-2600K (4 core 3.8Ghz turbo) for $324.99 as comparable. If I wanted to spend a little more, I could have gone with the AMD FX-8150 (8 core, 4.2Ghz turbo) for $249.99. Was $50 for .2Ghz worth it? Not really. Something bigger, better, and faster will be out next year, and the year after, and then I'll buy something new.

                I used newegg.com for all the prices, so it would be fairly even.

              The servers actually use as many cores as I can throw at them, so it's extremely beneficial to have more cores at high speeds.

              My desktop/gaming machine still has a Phenom IIx6 1100T in it. All the games I play, I can leave all the settings turned all the way up. Maybe if I ran benchmarks, I'd see something else gets a slightly faster frame rate, but I can't see any difference. As we all know, various benchmarks show different things.

          • Pardon my curiosity, but it sounds like you're building ~$400 servers out of basic desktop components. What kind of workload are you putting on these boxes that scales so well, yet doesn't justify the added expense of high-end server class hardware ? Maybe I'm at the other end of the spectrum, but I wouldn't dream of running a server without redundant power supplies and premium boards that have been built and tested to rigorous specs. The added hardware expense more than makes up for decreased maintenanc

            • by wvmarle ( 1070040 ) on Tuesday March 06, 2012 @03:37AM (#39259003)

              Google is known to build their servers from cheap parts.

              Like a RAID, but then a RAIS (Redundant Array of Independent Servers). Load distribution may be an issue as it has to seamlessly reassign tasks when a server is down for whatever reason. But for sufficiently large operations (five servers or more) this sounds to me like the way to go. Instead of trying to make every individual server highly reliable, go with the still very reliable user-grade stuff and get your reliability by redundancy. And companies like Google need more than one server anyway.

            • I used to work for a guy who built servers out of whatever spare parts he had lying around - obsolete desktops, refurbs, ebay junk, whatever. For a while, we were spending at least 10-15 hours a week keeping those things up, or driving down to the datacenter to physically reboot them.

              YMMV but I've got junker machines which have been running as servers 24/7 for years without a glitch. I'm just about to replace a few of them with Intel Atom boxes to save power/eardrums and I'm worrying about the reliability of the new machines. I'm going to keep the old machines lying around for at least a couple of months.

            • Well, $604.94/ea. The memory came with an 8GB Class 4 micro SD, and we got a $10 newegg gift card each. I forget what that was bundled with. If you consider a gift card as cash, they were under $600/ea.

              13-131-767 @$94.99/ea ASUS M5A97 AM3+
              17-822-008 @$24.99/ea DIABLOTEK PSDA500 500W RT
              19-103-961 @$199.99/ea AMD 8-CORE FX-8120 3.1G
              20-220-609 @$84.99/ea 4Gx4 PATRIOT PGD316G1600ELQK
              22-148-725 @$99.99/ea Seagate 1.5TB ST1500DL003 (x2)

              All of those are quantity 1, except the hard drives. They are 8 core 4Ghz (always running in Turbo mode), with 16GB ram, and RAID 1 on the drives. I opted to go more like Google's topless server. I used cable ties to mount up everything on wire racks from Home Depot. Ya, the same plastic/rubber coated ones you'd use in your closet. This is serving out of my house on a business FiOS line, so no one at a datacenter can complain. :) They're running amazingly cool. Because there's nothing interrupting normal convection air currents, all the heat sinks and drives are cool to the touch. They're a bit quieter than my desktop PC, because I don't require an extra fans to pull the hot air out of the case. My regular desktop has a 250cfm fan on it to keep it cool. Without it, and with the side on, it can overheat in a few minutes when gaming.

              The room does have an air conditioning return in it, which helps keep the room cool. The only fan I added was a HEPA filter. It's oversized for the room, but it'll help keep dust off the machines. The room is the same temperature as the rest of the house, so I'm happy with it. It serves no purpose for cooling the machines, since it's not even pointed at them. :)

              I have some pretty low load servers. Rather than buying a dozen of anything, I opted for using virtual machines. These two servers are hosting 4 VMs at this time, and there will be more. It's a young setup, and I have a lot of work to do on it. I opted to use VirtualBox. It works very well. I had intended trying VMWare ESXi or Citrix XenServer. unfortunately, neither would use the crappy software RAID that the boards provide, and I wasn't willing to drop money on real RAID controllers. I looked around a bit, and it seems that you can try to use some workarounds, but I didn't have the time or inclination to do it, where I could have VirtualBox going in less than an hour.

              The VMs are redundant between servers. Further on, you can read more about how I did it in the past between physical boxes. So if a single VM crashes, who cares. If a VM host crashes, well, it's reduced redundancy, but I'm still operating. I'm going to put out more VM hosts, and increase the redundancy. 4 machines with 6 VMs each is like 24 physical boxes. That's a serious savings, especially where the VM host costs about $600.

              Let me give you a little history. :)

              Long before Google made the pictures of the way they do servers, the company I was at was using COTS parts. That was voyeurweb.com (NSFW). They were hosting with a company not to be named (as in, I can't remember), who sold them on a $50k investment of a Sun server. They promised it was more power than anyone could ever want. That lasted about 3 days. It was after this, I got involved with them. We dropped about $15k on 10 servers. They were fairly cheap machines. Asus gaming motherboards, AMD K6/2 300 CPU, 512MB RAM, 8GB and 20GB IDE drives. The most expensive part at the time was the cases. It was pretty much what you'd be using at home at the time.

              We had the occasional failures, but they were usually due to load or CPU fan failures. At the time, they had under 1 million daily viewers, so we could handle that load on 4 of the 10 machines. Load balancing was done with DNS round robin. I know people say it's a poor system, but it worked well. There was typically a 3 second delay if you happened to hit a bad server, and then you'd roll off to the nex

          • Take a look at the benchmarks [anandtech.com]. The FX-8150 really doesn't come out looking good against the 2500k, much less against the i7-980.
          • by Kjella ( 173770 )

            For example, I just set up a couple servers from COTS parts. They used AMD FX-8120's (8 core, 4.0Ghz turbo) for $199.99/ea. It seems the comparable Intel is the i7-980 (6 core, 3.6Ghz), which is selling at $589.99.

            Modded informative? Only on slashdot... Also you compare turbo speeds (and GHz is silly anyway due to the difference in IPC), yet say:

            The servers actually use as many cores as I can throw at them, so it's extremely beneficial to have more cores at high speeds.

            If all cores are 100% loaded, you're not going to get anywhere close to max turbo. That's the extra boost it can give if only one core is working.

            Toms hardware suggests the i5-2500K (4 core, 3.7Ghz turbo) for $224.99 or i7-2600K (4 core 3.8Ghz turbo) for $324.99 as comparable.

            Tomshardware never tested the FX-8120, so that's a lie. They tested the FX-8150 and found [tomshardware.com]:

            In the very best-case scenario, when you can throw a ton of work at the FX and fully utilize its eight integer cores, it generally falls in between Core i5-2500K and Core i7-2600K

            The FX-8120 has 500 MHz lower base frequency which is far more significant than the 200 MHz lower max turbo. Not many have tested it but xbitl

            • It's too late in the evening for me to go chase down the Tom's Hardware link. I closed that tab a while ago.

              As for the rest... It works. It works well. It's cheaper. I don't upgrade desktops or servers every day. No one does, unless you're filthy rich and don't know where to spend your money. If that's the cast, you can send me some of that via PayPal on my site.

              This round of upgrades is replacing dual Opteron 1.4Ghz boxes, putting their respon

      • by sjames ( 1099 ) on Tuesday March 06, 2012 @12:44AM (#39258059) Homepage Journal

        Crash bugs are frustrating, but nowhere NEAR as scary as a bug that results in an incorrect but plausible computation. If the program crashes, you KNOW it crashed and you know the runs before that didn't crash are OK.

        Note that IRL the two cases can overlap. That is, a bug that might trigger a crash or might trigger an incorrect computation that might be plausible depending on luck of the draw.

    • by icebike ( 68054 ) * on Tuesday March 06, 2012 @12:35AM (#39257991)

      And it sounds like the sequence of instructions that causes it is not commonly found.

      Really?
      Pop two off the stack and ret to the calling routine seems fairly common to me. Lots of functions use two arguments and are called with near calls in various programming languages.

      • x86_64 ABI (Score:5, Interesting)

        by DrYak ( 748999 ) on Tuesday March 06, 2012 @02:12AM (#39258527) Homepage

        Pop two off the stack and ret to the calling routine seems fairly common to me. Lots of functions use two arguments and are called with near calls in various programming languages.

        That might have been true on 386s.

        But currently we're in 2012 and the most widely used instruction set for Linux on AMD processors is x86_64. Because these 64bit processors feature a big number of registers, the two arguments will be passed as registers, not on the stack. So the sequence of instructions isn't indeed common.

        • Not true. pop, pop, ret is a sequence that you are likely to see in any function that makes use of two or more callee-save registers - it will push them before using them and then pop them at the end. If you're lucky, the register allocator will have done some peephole optimisation and moved the pops earlier...
      • Pop two off the stack and ret to the calling routine seems fairly common to me. Lots of functions use two arguments and are called with near calls in various programming languages.

        The target function of the call has no business pushing or popping its arguments, ever. It doesnt work. Never has. The caller pushes the arguments and then in some calling conventions (such as STDCALL) the target function removes them from the stack using the return instruction itself ("ret 8" will remove 8 bytes of parameters) while in others the caller itself is responsible for removing the parameters (such as CDECL)

        Let me repeat that what you are describing is not possible. When the target function be

        • The pushes and pops involved are for call-saved registers, not for arguments. Over the years GCC has kinda flip-flopped over the best way to handle that... whether to use PUSH and POP or to use SUB/MOV/MOV/MOV/... the MOV sequences produce much longer instructions, so if you are space-concious (e.g. -Os), you are more likely to get PUSH/POP.

          Intel and AMD cpus, over the years, have been better or worse at optimizing instructions which adjust the stack pointer. These days PUSH/POP sequences should be as

      • And it sounds like the sequence of instructions that causes it is not commonly found.

        Really?
        Pop two off the stack and ret to the calling routine seems fairly common to me. Lots of functions use two arguments and are called with near calls in various programming languages.

        So how come AMD systems aren't crashing all over the place? Why did it take him a year to be able to reproduce it reliably? I think my desktop machine has one of those chips in it but it's never had an unexplained crash.

    • by synthesizerpatel ( 1210598 ) on Tuesday March 06, 2012 @12:59AM (#39258163)

      If your program is 'the kernel' then that qualifies as 'as bad as the division bug' && 'it's a big deal'.

    • by Forever Wondering ( 2506940 ) on Tuesday March 06, 2012 @01:19AM (#39258249)

      Though, it's still very serious. At least it generally causes your program to crash rather than spitting out a wrong answer. And it sounds like the sequence of instructions that causes it is not commonly found.

      I can well understand the guy who found it being all excited. The CPU is the last place you'd look for a bug, and finding one is pretty impressive, especially a really elusive one like this.

      Actually, it could be occurring in other places/programs that aren't crashing but are [silently] producing bad results. The floating point bug, once isolated, could be probed for, and compensated for.

      From what I can tell from reading the assembly code, the function is unremarkable except for the fact that it's recursive. It isn't doing anything exotic with the stack (e.g. just pushes at prolog and pops at epilog). The epilog is starting at +160 and the only thing I notice is that there are several conditional jumps there and just above it is a recursion call with a fall through. But, from the AMD analysis, it appears that it's the specific order of the push/pops that is the culprit. In this instance, it's r14, r13, r12, rbp, rbx

      The workaround for this bug might be that the compiler has to put a nop at the start of all function epilogs (e.g. a nop before the pop sequence) on every function because you can't predict which function will be susceptible. Or, you have to guarantee that the push/pop sequence doesn't emit the sequence that causes the problem (e.g. move the rbp push to the first in sequence as I suspect that putting it in the middle is what is causing the problems)

    • Oh, I've found CPU bugs before. But I never found one others hadn't already found. The 16MHz 80386 had a bug with counters. If you did a REP MOVSW or similar instruction in a 16 bit mode, starting on an odd address, and you made the pointer registers roll over, the CPU would lock up. Couldn't handle the transition from 0xFFFF to 0x0001 in either direction. That was fixed in all the faster 386's. As I recall, there were about a dozen bugs in the 386. Of course later processors were all checked for those specific bugs, so they never happened again.

      Then there's unintended features such as pipeline oddities. If you have self modifying code, and it changes the destination of a jump instruction immediately before executing it, the computer will jump to the old address. Step through those same instructions in a debugger, and it will jump to the new address. Strictly speaking, jumping to the old address is incorrect, but it doesn't break any good code and fixing it would wreck pipelining. This behavior has been known for a long time, and every CPU from at least the 386 to the Pentium 4 behaves this way. It wasn't an important problem because so little code was self modifying. Wasn't any good as a copy protection method either, as only an amateur would be fooled by it. I think it's been resolved in at least 2 ways. First, by amending the documentation for the instruction set to expressly state that behavior is undefined in such a case, and second, by proving that there is never any need for self modifying code. And making the separation between code and data explicit. Now we have No eXecution bits.

      There are sometimes even Easter eggs. For some processors, a few unassigned opcodes performed a useful operation. It wasn't by design. Is that a bug? Another case was the use of out of bounds values. For instance, the ancient 6502 supports this packed decimal arithmetic mode, in which 0x99 meant 99. So what happened when some joker gave it an illegal value such as 0xFF? 0xFF was interpreted as 15*10+15 = 165, and one could perform some math on it and get correct results. Divide 0xFF by 2 (shift right), and it would compute the correct result of 0x82. That sort of thing makes life tough for emulators, and I have yet to find an Apple II emulator that reproduces that behavior faithfully.

      • there is never any need for self modifying code

        There is when you're on a memory-constrained platform, which admittedly the PC is not. Selfmod code is still used in demo coding, especially with 256-byte and 4096-byte competitions, but that is exclusively an academic exercise.

        On an embedded system with just a few kbytes of memory, like say an ARM-powered gadget, self-modifying code is still relevant, even in 2012. Just because we can put 4 gigs of Ram in a toaster doesn't mean we should.

      • by AmiMoJo ( 196126 ) on Tuesday March 06, 2012 @03:37AM (#39258999) Homepage Journal

        Most of the undocumented op-codes on older CPUs were down to the fact that they were designed by hand rather than having the circuits computer generated. A computer will make sure all illegal op-codes are caught and generate an exception, but human beings didn't bother. Designers put in test op-codes as well which were usually just left in there for production. Even the way humans design circuits makes them more likely to produce useful undocumented op-codes and side-effects.

        It was somewhat risky to use them though because the manufacturer might decide to change CPU. The Z80 design was licensed out and any number of companies could supply them, all with their own unique bugs. Some games like to used these features for copy protection and then broke when the producer switched supplier.

    • by mysidia ( 191772 )

      Though, it's still very serious. At least it generally causes your program to crash rather than spitting out a wrong answer. And it sounds like the sequence of instructions that causes it is not commonly found.

      It may be uncommon to be found... but that doesn't equate to not exploitable

    • by Darinbob ( 1142669 ) on Tuesday March 06, 2012 @02:04AM (#39258485)

      CPUs have plenty of bugs. It's not necessarily the last place to look, especially for less popular processors. The only reason it's rarer with Intel and Intel-copying CPUs is because the market is so much bigger and therefore the resources for QA. Actually the bigger and more complex the processors are becoming the more likely it is to have bugs. Of course most are things people don't worry about or that can be worked around by following advice in the errata.

      In fact enough people assume CPUs have bugs only in the rarest of cases makes it hard to convince others that you have actually found a bug that's not in the errata. The same thing happens with compilers, you tell people that the bug must be in the compiler and they roll their eyes at you.

    • by Sycraft-fu ( 314770 ) on Tuesday March 06, 2012 @02:18AM (#39258575)

      You try and find something that "the other guy" had a problem with and bring it up as worse so as to try and "protect" the thing you are a fan about? Because I see nothing about the FDIV bug anywhere but your post.

      Oh and you know what that bug applied to, right? The Intel Pentium, the ORIGINAL Pentium. Not the Pentium MMX, not the Pentium Pro, not the Pentium II, not the Pentium III, not the Pentium 4, not the Core, not the Core 2, not the Core i, not the second generation Core i. And yes, that's how many major processor versions from Intel there have been since then (with another to launch in the next couple weeks). The original Pentium chips that had this problem came out almost 2 decades ago, 1993.

      So seriously, leave off it. I get tired of any time there is a problem with $Product_X fans of it will point out how $Product_Y had a similar or worse error way back in the day and that somehow changes things.

      No it doesn't. The story is about the AMD chips, nobody gives a shit about the FDIV bug and I'll wager there are people reading Slashdot who weren't alive when it happened.

      The good news for AMD is that processors can often patch around this shit in microcode these days so a recall may not be needed. Have to see, but the potential is there for a software (so to speak) fix.

      • I don't disagree with your rant, however it is just not a good idea to dismiss a processor bug as "happened a long time ago". The point is, it happened. Processor bugs happen. And here is one that happened last year [ibm.com] if you must.

      • So seriously, leave off it. I get tired of any time there is a problem with $Product_X fans of it will point out how $Product_Y had a similar or worse error way back in the day and that somehow changes things.

        Unless the $Product_X fans also point out that the maker of $Product_Y paid an awful lot of money to replace the broken CPUs.

      • by c ( 8461 )

        > I get tired of any time there is a problem with $Product_X fans of it will point
        > out how $Product_Y had a similar or worse error way back in the day and that
        > somehow changes things.

        The FDIV bug was really in a class of its own as CPU bugs go; it was trivially user accessible. You could test for the presence of the bug using a *spreadsheet*. This differs from pretty much every other CPU bug where you pretty much have to be cranking out some odd code before you see anything.

        Being as accessible as

    • Comment removed based on user account deletion
    • by mcgrew ( 92797 ) * on Tuesday March 06, 2012 @10:55AM (#39261601) Homepage Journal

      Ah, the memories...

      Though, it's still very serious. At least it generally causes your program to crash rather than spitting out a wrong answer.

      At Intel, Quality is Job 0.99989960954

      Q: What is a mad scientist?
      A: A researcher with a Pentium

      Q: How many Pentium designers does it take to screw in a light bulb?
      A: 1.99904274017, but that's close enough for non-technical people.

      Q: What's another name for the "Intel Inside" sticker they put on Pentiums?
      A: The warning label.

      Q: Why didn't Intel call the Pentium the 586?
      A: Because they added 486 and 100 on the first Pentium and got 585.999983605.

      Q: Did you hear about the new "morning after" pill being developed as a replacement for RU-486???
      A: Its called RU-Pentium. It causes the embryo to not divide correctly.

  • I wonder if AMD likes apples.

  • This is cool, but...?

    Why does it matter that it's the lead developer of DragonflyBSD?

    • I suppose to be sure he is not confused with the other Matt Dillon.
    • by icebike ( 68054 ) *

      Because if it were Joe Random Programmer AMD would not have even listened to him?

    • Because people read the name and they think it's Matt "There's Something About Mary", "Wild Things" oh ya with the two babes at once Dillon.

      Same reason they have to specify millionaire-playboy Bruce Wayne.

      • by m.dillon ( 147925 ) on Tuesday March 06, 2012 @11:33AM (#39262087) Homepage

        What's really amusing is that I've been on the scene for so long if you google my name 'Matthew Dillon', the first entry is actually... me! And not the actor(s). I'm sure that grinds a bit but I do bask in the occasional fan mail reaching my inbox, just before I hit the 'delete' key.

        In recent years its started to flip back and forth, and I expect Hollywood will again take over the top spot after things die down again :-)

        -Matt

    • Re:cool, but...? (Score:4, Insightful)

      by ffflala ( 793437 ) on Tuesday March 06, 2012 @12:40AM (#39258027)
      It matters because it's impressive. It also seems fair to associate some of the positive impression with DragonflyBSD, and I cannot see any downside to throwing good PR at any BSD flavor.
    • Re:cool, but...? (Score:5, Interesting)

      by wrook ( 134116 ) on Tuesday March 06, 2012 @12:51AM (#39258113) Homepage

      Matt Dillon is a rather famous programmer (as programmers go). I assume that's why they mention him by name. I think a very large percentage of old Amiga hackers know who he is. He's also done work on the Linux kernel. Despite all that, he's best known for his work on FreeBSD and on his DragonflyBSD project. While a lot of old timers will know that, not everyone else will.

    • Because now we know not only that something cool was done, but also who did it. Both are relevant.
    • Why does it matter that it's the lead developer of DragonflyBSD?

      As a kernel developer he works on code which manipulates CPUs at a low level. Thats why he found the bug.

    • Why does it matter that it's the lead developer of DragonflyBSD?

      It's nice to mention what the guy is known for. Besides, the bug came up when he was tinkering with DragonFly BSD.

      If we were completely pedantic, it ultimately does not matter. Anyone interested in computers and programming with enough talent could have found the bug. But yeah.

  • Kudos (Score:5, Insightful)

    by Mannfred ( 2543170 ) <mannfred@gmail.com> on Tuesday March 06, 2012 @12:44AM (#39258063)
    I can only imagine the time and effort spent on tracking down this problem - a rare CPU condition is exponentially more difficult to narrow down than most programming mistakes. A lot of progress in IT depends on engineers like this, who obsessively solve problems even when it's much easier to just ignore them, try to hack around them or pass the buck around. Kudos.
    • Yep, indeed. Kudos to Matt and his insight.

    • I agree but remember that must engineers are working on company time. For most companies it wouldn't be rational to have an engineer working months to isolate/reproduce this CPU bug. After all, this work will particularly benefit this company over all the other companies and at any rate it would be much cheaper to just do the workaround (which might be necessary anyway). However, a good engineer probably couldn't resist looking into this in his free time (and maybe in company time with nobody looking!) at
  • Affected CPUs (Score:5, Informative)

    by Anonymous Coward on Tuesday March 06, 2012 @12:49AM (#39258093)

    A pertinent addition to the submission would be which CPUs have been found to be affected.
    The second link says Opteron 6168 and Phenom II X4 820. For a second I thought that bulldozer hasn't managed to do anything right, but these two examples are pre bulldozer.
    No doubt this is not an exhaustive list.

    • That's exactly what I was wondering. The e-mail exchange didn't seem specific about which CPU was impacted. Please don't tell me it's every Opteron or Phenom in AMD's lines.
  • Windows? Does this mean that those users and devs aren't so important as far as total CPU load?

    • Linux in 64bit mode use register to pass arguments to functions.
      So the most common sequence at the end of a function isn't "bunch of pops then a (near) return", but "move the results into target registers and the return".
      Thus the bug sequence doesn't happen that often.

      • by dragonk ( 140807 )

        This would be insightful and all -- except that it isn't -- because DragonFly BSD uses the same x86-64 calling conventions as Linux.

        • Indeed, even WIN64 uses the same essential calling convention because the hardware itself is designed with it in mind.. specifically the 64-bit structured exception handling requires 16-byte aligned read and write operations by design.
  • Anyone know if that "Test case" image is available? I'd like to check if my Phenom II x6 is affected.

    • Presumably AMD will announce affected CPUs fairly soon, after they get done testing. This isn't the kind of thing they would be able to sit on, even if they wanted to. If your CPU has been working for you in general it isn't like it is going to suddenly go and beat up your cat or something, it'll be fine for a bit longer while AMD figures out which ones are all affected and figures out how to fix it.

      As I noted in another post, depending on it may be possible to fix it via microcode. CPUs aren't "pure" hardw

  • Confirmed CPUs (Score:5, Informative)

    by Jah-Wren Ryel ( 80510 ) on Tuesday March 06, 2012 @01:04AM (#39258197)

    FWIW:

    The failure has been observed on three different machines, all running AMD cpus. A quad opteron 6168 (48 core) box, and two Phenom II x4 820 boxes.

  • security exploit? (Score:3, Interesting)

    by Anonymous Coward on Tuesday March 06, 2012 @01:20AM (#39258251)

    I have to worry about stack smashing bugs here... can there be a way for (say) a data pattern in a media file, or carefully crafted javascript or java code that's been JIT-compiled, to break out of its sandbox? What about a hostile OS kernel running inside a VPS container taking over the hypervisor or bare iron? Hmm.

  • by hcs_$reboot ( 1536101 ) on Tuesday March 06, 2012 @01:46AM (#39258405)
    Matt Dillon [imdb.com], desperate after chasing unsuccessfully mary in Something about Mary [imdb.com] radically changed jobs and started to study computer science...
  • by m.dillon ( 147925 ) on Tuesday March 06, 2012 @10:57AM (#39261625) Homepage

    AMD has indicated to me that the Bulldozer is not effected, which is a relief.

    I guess I should have realized this would get slashdotted. In anycase, it took quite a bit of effort to track the bug down. It was very difficult to reproduce reliably. It isn't a show stopper in that it really takes a lot of work to get it to happen and most people will never see it, but it's certainly a significant bug owing to the fact that it can be reproduced with normal instruction sequences.

    I began to suspect it might be a cpu bug last year and after exhaustive testing I posted my suspicions in December:

    http://leaf.dragonflybsd.org/mailarchive/kernel/2011-12/msg00025.html [dragonflybsd.org]

    Older versions of GCC were more prone to generate the sequence of POP's + RET, coupled with a deep recursion and other stack state, that could result in the bug. It just so happened that DragonFly's buildworld hit the right combination inside gcc, and even then the bug only occurred sometimes and only one a small subset of .c files being compiled (like maybe 2-3 files). The bug never manifested anywhere else, doing anything else, running any other application. Ever.

    In particular the bug disappeared with later versions of GCC and disppeared when I messed with the optimizations. We use -O by default, not -O2. The bug disappeared when I produced code with gcc -O2 (using 4.4.7).

    It is really unlikely that Linux is effected... the sensitivity to particular code sequences laid out in the compiler is so fine that adding a single instruction virtually anywhere could make the bug disappear. Even just shifting the stack pointer a little bit would make it disappear.

    In anycase, for a programmer like me being able to find an honest-to-god cpu bug in a modern cpu is very cool :-)

    -Matt

    • by m.dillon ( 147925 ) on Tuesday March 06, 2012 @11:17AM (#39261875) Homepage

      Since the cat is out of the bag some further clarification is required so I will include some more of the email I received. I didn't quite mean for it to explode onto the scene this quickly, but oh well.

      Again, note that this is *NOT* an issue with Bulldozer. And they will have a MSR workaround for earlier models.

        >> quote
      "AMD has taken your example and also analyzed the segmentation fault and the fill_sons_in_loop code. We confirm that you have found an erratum with some AMD processor families. The specific compiled version of the fill_sons_in_loop code, through a very specific sequence of consecutive back-to-back pops and (near) return instructions, can create a condition where the processor incorrectly updates the stack pointer.

      AMD will be updating the Revision Guide for AMD Family 10h Processors and the Revision Guide for AMD Family 12h Processors, to document this erratum. In this documentation update, which will be available on amd.com later this month, the erratum number for this issue will be #721. The revision guide will also note a workaround that can be programmed in a model-specific register (MSR)."
          end quote

      They go on to document a specific workaround when the MSR is not programmed, which is basically to add a nop for every five pop+return instructions (though I'm not sure if the nop must occur between sequences or within the sequence). I will note that just the presence of 5xPOP + RET does not trigger the bug alone, it requires a very specific set of circumstances setup prior to that (that gcc's fill_sons_in_loop() procedure was able to trigger when gcc 4.7.x was compiled -O, when compiling particular .c files).

      As I said, this bug was very difficult to reproduce. It took a year to isolate it and find a test case that would reproduce it in a few seconds. Until then it was taking me upwards of 2 days to reproduce it on a 48-core and much longer to reproduce it on a 4-core.

      Since the bug was stack pointer address is sensitive the initial stack randomization that DragonFly does multiplied the time it took to reproduce the bug. But without the stack randomization the bug would NOT reproduce at all (I would never have observed it in the first place). In otherwords, the bug was *very* stack address sensitive on top of everything else.

      I was ultimately able to improve the time it took to reproduce the bug by pouring over all my previous buildworld runs and finding the .c files that gcc had compiled that were most statistically likely for gcc to seg-fault in. Then once I isolated the files I iterated all possible starting stack offsets and eventually managed to reproduce the bug within 10 seconds using a gcc loop (10-20 gcc runs on the same file).

      Changing the stack offset by a mere 16 bytes and the bug went away completely. The one or two particular stack offsets that reproduced the bug could then be further offset in multiples of 32K and still reproduce the bug at the same rate. Using a later version of gcc and the bug disappeared. Compiling with virtually any other options (turning on and off optimizations)... the bug disappeared.

      On the bright side, I thought this was a bug in DragonFly for most of last year and set about 'fixing' it, and wound up refactoring most of DragonFly's VM system to get rid of SMP bottlenecks and making it perform much better on SMP in the face of a high VM fault rate. So even though we wound up not doing the 2.12 release the eventual 3.0 release (that we just put out recently) has greatly improved cpu-bound performance on SMP systems.

      -Matt

"To take a significant step forward, you must make a series of finite improvements." -- Donald J. Atwood, General Motors

Working...