Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
AMD Bug BSD

AMD Confirms CPU Bug Found By DragonFly BSD's Matt Dillon 292

An anonymous reader writes "Matt Dillon of DragonFly BSD just announced that AMD confirmed a CPU bug he found. Matt quotes part of the mail exchange and it looks like 'consecutive back-to-back pops and (near) return instructions can create a condition where the processor incorrectly updates the stack pointer.' The specific manifestations in DragonFly were random segmentation faults under heavy load."
This discussion has been archived. No new comments can be posted.

AMD Confirms CPU Bug Found By DragonFly BSD's Matt Dillon

Comments Filter:
  • Re:cool, but...? (Score:5, Interesting)

    by wrook ( 134116 ) on Tuesday March 06, 2012 @01:51AM (#39258113) Homepage

    Matt Dillon is a rather famous programmer (as programmers go). I assume that's why they mention him by name. I think a very large percentage of old Amiga hackers know who he is. He's also done work on the Linux kernel. Despite all that, he's best known for his work on FreeBSD and on his DragonflyBSD project. While a lot of old timers will know that, not everyone else will.

  • by Forever Wondering ( 2506940 ) on Tuesday March 06, 2012 @02:19AM (#39258249)

    Though, it's still very serious. At least it generally causes your program to crash rather than spitting out a wrong answer. And it sounds like the sequence of instructions that causes it is not commonly found.

    I can well understand the guy who found it being all excited. The CPU is the last place you'd look for a bug, and finding one is pretty impressive, especially a really elusive one like this.

    Actually, it could be occurring in other places/programs that aren't crashing but are [silently] producing bad results. The floating point bug, once isolated, could be probed for, and compensated for.

    From what I can tell from reading the assembly code, the function is unremarkable except for the fact that it's recursive. It isn't doing anything exotic with the stack (e.g. just pushes at prolog and pops at epilog). The epilog is starting at +160 and the only thing I notice is that there are several conditional jumps there and just above it is a recursion call with a fall through. But, from the AMD analysis, it appears that it's the specific order of the push/pops that is the culprit. In this instance, it's r14, r13, r12, rbp, rbx

    The workaround for this bug might be that the compiler has to put a nop at the start of all function epilogs (e.g. a nop before the pop sequence) on every function because you can't predict which function will be susceptible. Or, you have to guarantee that the push/pop sequence doesn't emit the sequence that causes the problem (e.g. move the rbp push to the first in sequence as I suspect that putting it in the middle is what is causing the problems)

  • security exploit? (Score:3, Interesting)

    by Anonymous Coward on Tuesday March 06, 2012 @02:20AM (#39258251)

    I have to worry about stack smashing bugs here... can there be a way for (say) a data pattern in a media file, or carefully crafted javascript or java code that's been JIT-compiled, to break out of its sandbox? What about a hostile OS kernel running inside a VPS container taking over the hypervisor or bare iron? Hmm.

  • Oh, I've found CPU bugs before. But I never found one others hadn't already found. The 16MHz 80386 had a bug with counters. If you did a REP MOVSW or similar instruction in a 16 bit mode, starting on an odd address, and you made the pointer registers roll over, the CPU would lock up. Couldn't handle the transition from 0xFFFF to 0x0001 in either direction. That was fixed in all the faster 386's. As I recall, there were about a dozen bugs in the 386. Of course later processors were all checked for those specific bugs, so they never happened again.

    Then there's unintended features such as pipeline oddities. If you have self modifying code, and it changes the destination of a jump instruction immediately before executing it, the computer will jump to the old address. Step through those same instructions in a debugger, and it will jump to the new address. Strictly speaking, jumping to the old address is incorrect, but it doesn't break any good code and fixing it would wreck pipelining. This behavior has been known for a long time, and every CPU from at least the 386 to the Pentium 4 behaves this way. It wasn't an important problem because so little code was self modifying. Wasn't any good as a copy protection method either, as only an amateur would be fooled by it. I think it's been resolved in at least 2 ways. First, by amending the documentation for the instruction set to expressly state that behavior is undefined in such a case, and second, by proving that there is never any need for self modifying code. And making the separation between code and data explicit. Now we have No eXecution bits.

    There are sometimes even Easter eggs. For some processors, a few unassigned opcodes performed a useful operation. It wasn't by design. Is that a bug? Another case was the use of out of bounds values. For instance, the ancient 6502 supports this packed decimal arithmetic mode, in which 0x99 meant 99. So what happened when some joker gave it an illegal value such as 0xFF? 0xFF was interpreted as 15*10+15 = 165, and one could perform some math on it and get correct results. Divide 0xFF by 2 (shift right), and it would compute the correct result of 0x82. That sort of thing makes life tough for emulators, and I have yet to find an Apple II emulator that reproduces that behavior faithfully.

  • by GoodNewsJimDotCom ( 2244874 ) on Tuesday March 06, 2012 @03:08AM (#39258503)
    Heh. I coded a nice tile based RPG out of it, but I couldn't make it MMOG because there is no socket code in Quick Basic. The trick to making big games in Quick Basic is to write your own Virtual Disk so you can get past the 640k memory limit. Once you have a virtual disk, you can write an interpreted language inside Quick Basic, then your code is simply loaded up in a custom database. I rewrote the whole thing in C/C++ because people told me I could get socket libraries in it, but I gave up on my game entirely when Ultima Online came out because I felt I wouldn't be able to build up a market because my graphics are so bad. I was partially right in thinking there is only enough room for one MMORPG at a time back in 97, but I think I shouldn't have gave up after having coded for thousands of hours with things like Farmville succeeding today.
  • x86_64 ABI (Score:5, Interesting)

    by DrYak ( 748999 ) on Tuesday March 06, 2012 @03:12AM (#39258527) Homepage

    Pop two off the stack and ret to the calling routine seems fairly common to me. Lots of functions use two arguments and are called with near calls in various programming languages.

    That might have been true on 386s.

    But currently we're in 2012 and the most widely used instruction set for Linux on AMD processors is x86_64. Because these 64bit processors feature a big number of registers, the two arguments will be passed as registers, not on the stack. So the sequence of instructions isn't indeed common.

  • by Sycraft-fu ( 314770 ) on Tuesday March 06, 2012 @03:18AM (#39258575)

    You try and find something that "the other guy" had a problem with and bring it up as worse so as to try and "protect" the thing you are a fan about? Because I see nothing about the FDIV bug anywhere but your post.

    Oh and you know what that bug applied to, right? The Intel Pentium, the ORIGINAL Pentium. Not the Pentium MMX, not the Pentium Pro, not the Pentium II, not the Pentium III, not the Pentium 4, not the Core, not the Core 2, not the Core i, not the second generation Core i. And yes, that's how many major processor versions from Intel there have been since then (with another to launch in the next couple weeks). The original Pentium chips that had this problem came out almost 2 decades ago, 1993.

    So seriously, leave off it. I get tired of any time there is a problem with $Product_X fans of it will point out how $Product_Y had a similar or worse error way back in the day and that somehow changes things.

    No it doesn't. The story is about the AMD chips, nobody gives a shit about the FDIV bug and I'll wager there are people reading Slashdot who weren't alive when it happened.

    The good news for AMD is that processors can often patch around this shit in microcode these days so a recall may not be needed. Have to see, but the potential is there for a software (so to speak) fix.

  • by AmiMoJo ( 196126 ) on Tuesday March 06, 2012 @04:37AM (#39258999) Homepage Journal

    Most of the undocumented op-codes on older CPUs were down to the fact that they were designed by hand rather than having the circuits computer generated. A computer will make sure all illegal op-codes are caught and generate an exception, but human beings didn't bother. Designers put in test op-codes as well which were usually just left in there for production. Even the way humans design circuits makes them more likely to produce useful undocumented op-codes and side-effects.

    It was somewhat risky to use them though because the manufacturer might decide to change CPU. The Z80 design was licensed out and any number of companies could supply them, all with their own unique bugs. Some games like to used these features for copy protection and then broke when the producer switched supplier.

  • Well, $604.94/ea. The memory came with an 8GB Class 4 micro SD, and we got a $10 newegg gift card each. I forget what that was bundled with. If you consider a gift card as cash, they were under $600/ea.

    13-131-767 @$94.99/ea ASUS M5A97 AM3+
    17-822-008 @$24.99/ea DIABLOTEK PSDA500 500W RT
    19-103-961 @$199.99/ea AMD 8-CORE FX-8120 3.1G
    20-220-609 @$84.99/ea 4Gx4 PATRIOT PGD316G1600ELQK
    22-148-725 @$99.99/ea Seagate 1.5TB ST1500DL003 (x2)

    All of those are quantity 1, except the hard drives. They are 8 core 4Ghz (always running in Turbo mode), with 16GB ram, and RAID 1 on the drives. I opted to go more like Google's topless server. I used cable ties to mount up everything on wire racks from Home Depot. Ya, the same plastic/rubber coated ones you'd use in your closet. This is serving out of my house on a business FiOS line, so no one at a datacenter can complain. :) They're running amazingly cool. Because there's nothing interrupting normal convection air currents, all the heat sinks and drives are cool to the touch. They're a bit quieter than my desktop PC, because I don't require an extra fans to pull the hot air out of the case. My regular desktop has a 250cfm fan on it to keep it cool. Without it, and with the side on, it can overheat in a few minutes when gaming.

    The room does have an air conditioning return in it, which helps keep the room cool. The only fan I added was a HEPA filter. It's oversized for the room, but it'll help keep dust off the machines. The room is the same temperature as the rest of the house, so I'm happy with it. It serves no purpose for cooling the machines, since it's not even pointed at them. :)

    I have some pretty low load servers. Rather than buying a dozen of anything, I opted for using virtual machines. These two servers are hosting 4 VMs at this time, and there will be more. It's a young setup, and I have a lot of work to do on it. I opted to use VirtualBox. It works very well. I had intended trying VMWare ESXi or Citrix XenServer. unfortunately, neither would use the crappy software RAID that the boards provide, and I wasn't willing to drop money on real RAID controllers. I looked around a bit, and it seems that you can try to use some workarounds, but I didn't have the time or inclination to do it, where I could have VirtualBox going in less than an hour.

    The VMs are redundant between servers. Further on, you can read more about how I did it in the past between physical boxes. So if a single VM crashes, who cares. If a VM host crashes, well, it's reduced redundancy, but I'm still operating. I'm going to put out more VM hosts, and increase the redundancy. 4 machines with 6 VMs each is like 24 physical boxes. That's a serious savings, especially where the VM host costs about $600.

    Let me give you a little history. :)

    Long before Google made the pictures of the way they do servers, the company I was at was using COTS parts. That was voyeurweb.com (NSFW). They were hosting with a company not to be named (as in, I can't remember), who sold them on a $50k investment of a Sun server. They promised it was more power than anyone could ever want. That lasted about 3 days. It was after this, I got involved with them. We dropped about $15k on 10 servers. They were fairly cheap machines. Asus gaming motherboards, AMD K6/2 300 CPU, 512MB RAM, 8GB and 20GB IDE drives. The most expensive part at the time was the cases. It was pretty much what you'd be using at home at the time.

    We had the occasional failures, but they were usually due to load or CPU fan failures. At the time, they had under 1 million daily viewers, so we could handle that load on 4 of the 10 machines. Load balancing was done with DNS round robin. I know people say it's a poor system, but it worked well. There was typically a 3 second delay if you happened to hit a bad server, and then you'd roll off to the nex

  • Re:Affected CPUs (Score:4, Interesting)

    by dargaud ( 518470 ) <[ten.duagradg] [ta] [2todhsals]> on Tuesday March 06, 2012 @06:19AM (#39259331) Homepage
    I just went and checked on their microcode page [amd64.org], but the last download is fairly old. Anyway, the explaination on how to update on Linux is not clear:

    Support for updating microcode for the AMD processors listed above will be available starting with kernel version 2.6.29. Microcode update for AMD processors uses the firmware loading infrastructure.

    Does that mean that the kernel uploads the new microcode on boot ? How does it get it ?

  • by m.dillon ( 147925 ) on Tuesday March 06, 2012 @12:17PM (#39261875) Homepage

    Since the cat is out of the bag some further clarification is required so I will include some more of the email I received. I didn't quite mean for it to explode onto the scene this quickly, but oh well.

    Again, note that this is *NOT* an issue with Bulldozer. And they will have a MSR workaround for earlier models.

      >> quote
    "AMD has taken your example and also analyzed the segmentation fault and the fill_sons_in_loop code. We confirm that you have found an erratum with some AMD processor families. The specific compiled version of the fill_sons_in_loop code, through a very specific sequence of consecutive back-to-back pops and (near) return instructions, can create a condition where the processor incorrectly updates the stack pointer.

    AMD will be updating the Revision Guide for AMD Family 10h Processors and the Revision Guide for AMD Family 12h Processors, to document this erratum. In this documentation update, which will be available on amd.com later this month, the erratum number for this issue will be #721. The revision guide will also note a workaround that can be programmed in a model-specific register (MSR)."
        end quote

    They go on to document a specific workaround when the MSR is not programmed, which is basically to add a nop for every five pop+return instructions (though I'm not sure if the nop must occur between sequences or within the sequence). I will note that just the presence of 5xPOP + RET does not trigger the bug alone, it requires a very specific set of circumstances setup prior to that (that gcc's fill_sons_in_loop() procedure was able to trigger when gcc 4.7.x was compiled -O, when compiling particular .c files).

    As I said, this bug was very difficult to reproduce. It took a year to isolate it and find a test case that would reproduce it in a few seconds. Until then it was taking me upwards of 2 days to reproduce it on a 48-core and much longer to reproduce it on a 4-core.

    Since the bug was stack pointer address is sensitive the initial stack randomization that DragonFly does multiplied the time it took to reproduce the bug. But without the stack randomization the bug would NOT reproduce at all (I would never have observed it in the first place). In otherwords, the bug was *very* stack address sensitive on top of everything else.

    I was ultimately able to improve the time it took to reproduce the bug by pouring over all my previous buildworld runs and finding the .c files that gcc had compiled that were most statistically likely for gcc to seg-fault in. Then once I isolated the files I iterated all possible starting stack offsets and eventually managed to reproduce the bug within 10 seconds using a gcc loop (10-20 gcc runs on the same file).

    Changing the stack offset by a mere 16 bytes and the bug went away completely. The one or two particular stack offsets that reproduced the bug could then be further offset in multiples of 32K and still reproduce the bug at the same rate. Using a later version of gcc and the bug disappeared. Compiling with virtually any other options (turning on and off optimizations)... the bug disappeared.

    On the bright side, I thought this was a bug in DragonFly for most of last year and set about 'fixing' it, and wound up refactoring most of DragonFly's VM system to get rid of SMP bottlenecks and making it perform much better on SMP in the face of a high VM fault rate. So even though we wound up not doing the 2.12 release the eventual 3.0 release (that we just put out recently) has greatly improved cpu-bound performance on SMP systems.

    -Matt

  • It got more complex as it grew. This is all from memory, so there may be some inconsistencies with what I'm saying.

    The main site primarily used 3 subdomains. There were also 3 pay sites. For a while there were 6 machines for video streaming. The video streaming site went away, mostly due to cost vs income. We added free hosting, which was small to start, and picked up popularity and grew to dozens of servers. There were side projects launched on their own servers. Some worked. Some were dropped.

    Masterstats.com was an example of that. In function, it was similar to Google Analytics. It started as 1 DB and 2 web servers. Because of the load thrown at it, it became 3 DB, 4 web, and 2 offline calculation servers. The offline machines were just to process stats that were too intensive to do live, and created an undue load on the web servers.

    Another example was our backups. If you go digging through my journal posts, you'll find me talking about multi TB arrays, back when the largest drives on the market were 250GB. Back in the first iteration, it was fairly simple to keep backups. That grew as we kept throwing more into it.

    These are the rough counts just for the main sites.

    Before I took over, there were 3 subdomains (www, voy, and ww3.) Each was on one server. If one server failed, that part of the site stopped. Needless to say, that was bad if (and when) www. went down.

    4 servers in one city. All 4 machines had all 3 subdomains.

    8 servers in two cities. That's 4 machines in 2 cities. Each set could handle the full load, in the event we lost a city. That meant each city handled 50% of the traffic normally, or 100% in a city failure. We could operate on 2 per city, but the extra 2 provided for redundancy.

    When we scaled out to 3 cities, we split the 3 subdomains between groups of servers in each city. We also divided up the load equally, so each city got 33% of the traffic. Having a city fail increased the traffic to the other cities by 17%. The sites data existed on all the servers, so in the event of a failure of one server, we could distribute that load. So now we had 3 to 5 servers per group, in 3 cities. Newer hardware was faster, so those would have combined duties. Clearly a dual 1.4Ghz machine could handle more than a single 350Mhz machine. At this point, we had retired almost all of the 350Mhz machines, except for a few that were recycled to do DNS and other low-load tasks.

    When we got to 5 cities (New York, Los Angeles, San Diego, Tampa, and Frankfurt), the loads got divided up differently. Frankfurt had one 100Mb/s circuit. San Diego had one 100Mb/s circuit. New York had two GigE circuits. Los Angles had 2 GigE circuits, which grew to 3 when we retired San Diego. Tampa had 2 GigE circuits. Frankfurt was retired, and the traffic was just added to New York. That barely made a blip on the bandwidth graphs.

    Members sites had a warm friendly map so they could pick their city to view from. We looked at other options, but it was simple to give them a map to choose where to serve from, and they could pick another city any time they wanted. Forums had people discussing which ones were "faster". It was very subjective. People in the West would pick Los Angeles or San Diego, with most preferring Los Angeles. People in the East picked Tampa or New York. People in Europe said Frankfurt was too slow, and they had better access via New York or Tampa.

    Each city had different load characteristics. Free hosting servers were deployed in 3 cities. There were some special case servers too. For example, where someone had a very high load "free hosting" site that made a lot of money via the AVS, could get their own server or servers.

    Pet projects got their own servers as needed.

    I

Our business in life is not to succeed but to continue to fail in high spirits. -- Robert Louis Stevenson

Working...