Forgot your password?
typodupeerror
Bug Ubuntu Hardware

Tracking Down a Single-Bit RAM Error 277

Posted by timothy
from the you'll-need-a-nice-microscope dept.
Hanji writes "We have discussed here before the potential effects of and protections against cosmic ray radiation, but for the average computer user, it's an obscure threat that doesn't affect them in any real way. Well, here's a blog post that describes a strange segfault and, after extensive debugging, traces it down to a single bit flip, probably caused by a stray cosmic ray. Lots of helpful descriptions of Linux debugging techniques in this one, and a pretty clear demonstration that this can be a real problem. I know I'm never buying a desktop without ECC RAM ever again!" The author acknowledges that it might not have been a cosmic ray-based error, but the troubleshooting steps are interesting no matter what the cause.
This discussion has been archived. No new comments can be posted.

Tracking Down a Single-Bit RAM Error

Comments Filter:
  • by Kufat (563166) <.ten.tafuk. .ta. .tafuk.> on Thursday June 24, 2010 @07:15PM (#32685178) Homepage

    One of my computers had an intermittent failure in a RAM chip/line/something somewhere that mostly manifested as SHA/MD5 failures when I was checksumming large files that I'd downloaded. Never showed up in Memtest86, but eventually I eliminated every other possibility. IIRC, I solved it by underclocking the machine and then replacing it when I was able.

    • by Anonymous Coward on Thursday June 24, 2010 @08:04PM (#32685622)

      You are on the right track. As someone with over a quarter century of background in combined embedded software and hardware design (the most recent decade for life-dependant systems), it always amazes me how quickly pseudo-technical people jump to wild speculation for observations that they cannot explain.

      They fail to understand that a hardware system is an imperfect representation of the theory (probably the biggest failure in the schooling of software developers and even some hardware is to get this message into their heads). While they feel comfort in the theory of a binary system, they utterly fail to understand that our real systems, like us, are imperfect and, like us, live in an analog world. Simple things like temperature variations, noise from common (rather than cosmic) sources, marginal design timing, imperfect components, simple intermittents, etc., are 10^24 times more likely the cause.

      But they're not as fascinating as wild speculation, are they?

      • Not only that, but they are also systems we can only approach from a very abstract perspective when it comes to debugging. Our options to debug complex hardware are very abstract, inaccurate, and incomplete.

        • Single bit errors (Score:3, Interesting)

          by mollog (841386)
          While working as a failure analysis technician at a company that made a disk controller, I came across a single-bit error in static RAM cache that was repeatable. I was lucky to have the software and hardware tools available and I eventually tracked down the failure mode. Setting a bit at a certain location would cause another, different location's bit to get set. Just that one bit. And only if you set it. Resetting it did not cause the other bit to reset.

          This turned out to be a manufacturing problem wit
      • by Anonymous Coward on Thursday June 24, 2010 @09:18PM (#32686188)

        On the subject of the imperfect nature of machines, I found this post by Richard D. James (aka Aphex Twin, a noted electronic music composer) quite interesting. He describes how the physical machinery of analog electronic music machines means it is near impossible to duplicate them in digital programs.

        link [archive.org]

        Author: analord
        Date: 02-07-05 03:14

        some people bought the analogue equipment when it was unfashionable and very cheap though.
        some of us are over 30 you know!
        anyone remember when 303`s were £50? and coke was 16p a tin? crisps 5p

        also you have overlooked A LOT of other points because its not all about the overall frequency response of the recording system its how the sound gets there in the first place.
        here are some things which you can`t get from a plugin,they are often emulated but due to their hugely complex nature are always pretty crass aproximations..

        the sound of analogue equpiment including EQ, changes very noticably over even a few hours due to temperature changes within a circuit.
        Anyone who has tried to make tracs on a few analogue synths and make them stay in tune can tell you this,you leave a trac running for a few hours come back and think Im sure I didnt fucking write that,I must be going mental!

        this affects all the components in a synth/EQ in an almost infinte amount of tiny ways.
        and the amount differs from circuit to circuit depending on the design.

        the interaction of different channels and their respective signals with an analogue mixer are very complex,EQ,dynamics....
        any fx, analogue or digital that are plugged into it all have their own special complex characteristics and all interact with each other differently and change depending on their routing.
        Nobody that ive heard of has even begun to start emulating analogue mixer circuitry in software,just the aesthetics,it will come but im sure it will be a crap half hearted effort like most pretend synth plugins are.
        they should be called PST synths, P for pretend not virtual.

        Every piece of outboard gear has its own sound ,reverbs,modulation effects etc
        real room reverb, this in itself companies have spent decades trying to emulate and not even got close in my opinion, even the best attempts like Quantec and EMT only scratch the surface.

        analogue EQ is currently impossible in theory to be emulated digitally,quite intense maths shit involed in this if youre really that interested,you could look it up...good luck.

        your soundcard will always make things sound like its come from THAT soundcard..they ALL impose their different sound characteristics onto whatever comes out of them they are far from being totally neutral devices.

        all the components of a circuit like resistors and capacitors subtley differ from each other depending on their quality but even the most high quality milatary spec ones are never EXACTLY the same.

        no two analogue synths can ever be built exactly the same,there are tiny human/automated errors in building the circuits,tweaking the trimpots for example which is usually done manually in a lot of analogue shit.
        just compare the sound of 2 808 drum machines next to each other and you will see what I mean,you always thought an 808 was an 808 right?
        same goes for 303`s they all sound subltey different,different voltage scaling of the oscillator is usually quite noticable.

        VST plugins are restricted by a finite number of calculations per second these factors are WAY beyond their CURRENT capability.

        Then there is the question of the physicallity of the instrument this affects the way a human will emotionally interact with it and therfore affect what they will actually do with it! often overlooked from the maths heads,this is probably the biggest factor I think.
        for example the smell of analogue stuff as well as the look of it puts y

      • Re: (Score:2, Funny)

        by bitflip (49188)

        It was me.

        Sorry 'bout that.

      • Electronics are designed well within tolerances for temperature and EM interference. At least, good ones are. Since my fans are broken, I've been running the GPU in my Thinkpad to 107C every day for a few years when I play games. No problems yet.

        As someone with over a quarter century of background in

        As someone who hasn't been in school in 30 years, memory loss, sits on the porch with a shotgun hollering at kids, has to call his grandson to install the newfangled Norton Internet Security because you've been

      • Re: (Score:3, Informative)

        by w0mprat (1317953)
        Occam's DIMM I'm affraid. I had a stick of DDR2 that had a stuck bit that caused almost exactly the same issue as TFA. As a hardware geek, not a *nix geek with time to waste, I went straight to memtest86, and there it was, one single stuck bit.

        Although interesting, TFA it is without a doubt the most pedantic and roundabout way I've ever read of establishing your rig is not stable.

        From TFSA:

        And in fact, since that incident, I've had several other, similar problems. I haven't gotten around to memtesting my machine, but that does suggest I might just have a bad RAM chip on my hands.

        Yeah he has a stuck or semi-stuck bit and a hour or two of his life he won't get back.

        In such a circumstance I'

  • I was hoping this would be more info about the Voyager 2 incident that occurred recently. No doubt, a detailed account of what they recently went through to find and fix the problem would be most interesting.
  • Takes me back (Score:5, Interesting)

    by tsotha (720379) on Thursday June 24, 2010 @07:16PM (#32685196)
    When I was in college one of my physics professors told us he doubted programs would ever get bigger than a few hundred kilobytes because cosmic rays would cause the larger programs to fail too frequently.
    • by griffjon (14945)

      There's a Redmond joke in here somewhere. Regardless, I'm going to start blaming all my typos on bitflips caused by cosmic rays.

    • Re: (Score:3, Funny)

      by Jurily (900488)

      larger programs to fail too frequently

      We showed him right, huh?

  • Easter Earthquake (Score:5, Interesting)

    by ushering05401 (1086795) on Thursday June 24, 2010 @07:17PM (#32685198) Journal

    I don't know about cosmic rays, but immediately following the Easter day Earthquake in Guadalupe Victoria (about three hundred miles from where I was located) I tried to fire up my laptop and then my desktop, both of which had been suspended to RAM. Neither one would wake up, though the lappie displayed a garbled screen. No errors in the log files (Ubuntu 9.10 on the sys76 lappie, Deb Lenny on desktop).

    • by Darkness404 (1287218) on Thursday June 24, 2010 @07:21PM (#32685236)
      Wouldn't that be more likely caused by fluctuations in the power supply though? I'm not an electrical engineer nor an expert on earthquakes, but wouldn't it be possible that a quick loss of power or too high of power for a split second could mess up the data on the RAM?
    • by timeOday (582209)
      In all my years of attempting to use suspend-to-RAM on Linux, it has always been, ahem, highly probabilistic.
  • RAM error? (Score:5, Interesting)

    by Camel Pilot (78781) on Thursday June 24, 2010 @07:19PM (#32685208) Homepage Journal

    Forget a RAM error, I have seen a bit on a file on the disk flip.

    After years of successful operation a Perl script quite working. On investigation a G was transformed to a W a difference of one bit. The file mod date was years old.

    • Re:RAM error? (Score:5, Interesting)

      by marcansoft (727665) <hector@@@marcansoft...com> on Thursday June 24, 2010 @07:25PM (#32685278) Homepage

      I experienced almost exactly that issue with a RAM error. My system was apparently stable, and then one day I got a syntax error in a system Perl script: one character had changed. The script was owned by root and otherwise untouched. After puzzling over it for quite a while I realized it could be a RAM error and ran memtest86. It reported a single permanently stuck bit in my 512MB of RAM. I found a kernel patch to manually mark problem RAM areas as reserved and kept on running with that RAM for a few years.

      Are you sure that perl script issue was caused by a drive error? A RAM error can cause the same apparent problem, if the corruption happens in the kernel's cache. However, it shouldn't be permanent as it will not be written back to disk (the cache won't be dirty) unless someone actually modifies the file.

      • Would the perl script be loaded at the same address in RAM every time? Wouldn't that likely be a one-time unrepeatable problem?

        • Re: (Score:3, Informative)

          by marcansoft (727665)

          The perl script will stay cached until something else pushes it out of RAM or until you reboot the system. In general, files are loaded once and stick around for quite a while unless you're low on RAM. In my case, it stayed cached while I investigated it, and I could see the broken character with various viewers. Bad RAM could also cause an intermittent issue if it happened to affect memory used by the Perl interpreter to load the file (that would change each time), but in this case it affected the kernel's

        • Re: (Score:3, Informative)

          by Chris Burke (6130)

          Would the perl script be loaded at the same address in RAM every time? Wouldn't that likely be a one-time unrepeatable problem?

          If the stuck bit was in the file cache, then it would be repeatable for as long as the script stayed cached, plus you could load the file up in a text editor and see the changed character, etc. Then it would mysteriously go away.

      • Also (Score:5, Informative)

        by Sycraft-fu (314770) on Thursday June 24, 2010 @07:56PM (#32685526)

        Disks have a lot, and I mean a LOT of ECC on them. It is not a situation of "I need to write a 1 so I'll place one at this location on the drive." They use a complex encoding scheme so that bit errors on the disk don't yield data errors to the user.

        Then there's the fact that bits aren't even stored as bits really. All current drives use (E)PRML which is (Enhanced) Partial Response Maximum Likelihood. What this means is bits aren't encoded as a high-low state or FM wave or any of that. They are written using flux reversals, but the level is not carefully controlled, it can't be. So when you read the data the drive actually looks at an analogue wave. It encodes the partial response it gets, and then finds the maximumly likely pattern that matches.

        Sounds like voodoo but works really well. Things are not simple thresholds or the like, it is a complex system and ends up being quite robust and resilient to error.

        So it is highly unlikely that you had a bit flipped on a disk. Would require some amazing circumstances to happen. The RAM error is far more likely. Not just the cosmic ray thing but, as the parent noted, bad RAM. Normally when RAM fails, it fails catastrophically and it is immediately apparent. Not always though. It can not only fail on single bit locations, but only during certian ops. That is why memtest does so many different tests. One kind might works fine, another might fail. Rare, but I've seen it on a few systems.

        • Re: (Score:3, Informative)

          by marcansoft (727665)

          However, single-bit errors are possible with faulty disk hardware. The cache RAM on the disk or its interface can be flaky, and for PATA disks a bad cable can cause single-bit errors. SATA disks usually catch IO errors since they use a more complicated encoding and make use of checksums.

          • by Vellmont (569020)


            However, single-bit errors are possible with faulty disk hardware.

            I'm sure you're right, but in this case there's essentially no way a disk hardware failure is going to cause the same bit to fail the same way, but no other bits fail.

            In this case, I'd expect it's a bit flip in the OS disk cache.

            • The bit fails while it is read from the disk, then persists in the OS cache. The end result is the same (a corrupted OS cache), but the cause is different, as the bit flipped before it ever made it to the cache.

        • Re:Also (Score:5, Funny)

          by Scaba (183684) <`joe' `at' `joefrancia.com'> on Thursday June 24, 2010 @08:54PM (#32686014)

          Then there's the fact that bits aren't even stored as bits really. All current drives use (E)PRML which is (Enhanced) Partial Response Maximum Likelihood. What this means is bits aren't encoded as a high-low state or FM wave or any of that. They are written using flux reversals, but the level is not carefully controlled, it can't be. So when you read the data the drive actually looks at an analogue wave. It encodes the partial response it gets, and then finds the maximumly likely pattern that matches.

          I doubt this is true. The disk would have to be spinning at 88 mph in order to activate the flux capacitor, and the power brick would need to supply 1.21 gigawatts to the drive, which exceeds the capacity of even the most tricked-out gaming PC. I think you'd better check your science, my friend.

        • by fishexe (168879)
          I shouldn't have spent all my mod points yesterday. I guess my hardware knowledge is obsolete; I had no idea modern HDDs don't store individual bits anymore.
          • Re: (Score:3, Interesting)

            by Sycraft-fu (314770)

            Been common with all kinds of things for some time. CD-ROMs, for example, use EFM, eight-to-fourteen modulation at their most base level. Eight logical bits are actually written as fourteen pits on the disc. Again the reason is error correction. They have more error correction at higher levels, and even more for data CDs. That's why data and audio CDs don't add up. An 80 minute CD holds 700MB of data. But do the math on 44.1kHz, 16-bit, 2-track audio and it takes 800MB of data to hold. Well in data mode the

        • We use Dell PowerEdge servers with Dell Open Manage software installed. When a single bit error occurs, it will log a warning with regards the module at fault. I've cleared the log and reseated the ECC RAM module only for it to happen again within a few minutes.

          So yes, silicon chips (or gates inside) go bad. I can't tell you why or how exactly, just that they do.

      • by JWSmythe (446288)

            I'd second the idea of a filesystem error. I had a mystery error show up similar to what he described. Someone modified one of my files, only changing one character. I was the only one with access to the machine. I fixed it, and voila, problem solved. A few weeks later, filesystem errors started showing up in the system log. It was a failing drive, not just a dirty filesystem. It must have been cosmic radiation damaged the disk. :)

    • Aha, my plan worked perfectly *rubs hands in delight*. I hack the entire internet at once by flipping single bits on a large number of machines. The maths is kind of chaotic. It's fun to track viruses as ant-algorithm analogies too.
    • by Vellmont (569020)

      How did you verify it was actually on the disk, and not read from disk cache in memory?

      Disk sectors have CRC checksums on them, so it's just extremely unlikely the bits flipped on the physical medium. It seems even less likely the bit got flipped somehow that caused a write to disk (and your file mod date would suggest this was unlikely as well).

    • by hondo77 (324058)

      Forget a RAM error, I have seen a bit on a file on the disk flip.

      After years of successful operation a Perl script quite working. On investigation a G was transformed to a W a difference of one bit. The file mod date was years old.

      Ditto, except it was something like a w to a 7.

    • Re: (Score:3, Funny)

      by Rinikusu (28164)

      /*After years of successful operation a Perl script quite working*/

      And a bit flipped to an e?

    • by Xyrus (755017)

      Quit working? I'm surprised that didn't turn your perl script into pong.

    • Re: (Score:3, Funny)

      by petsounds (593538)

      And 10,000 years from now, your Perl script has become the complete works of Shakespeare...

  • by EmagGeek (574360) <gterich AT aol DOT com> on Thursday June 24, 2010 @07:19PM (#32685214) Journal

    Soft errors in DRAM are far more likely to be the result of alpha particle decay from materials in the die and packaging.

    • Re: (Score:3, Interesting)

      by cusco (717999)
      People don't realize that lead is mildly radioactive, and the decay from solders on the connectors or chassis can also cause bit flips. Very old processed lead, such as that used for the roofs of some European cathedrals, has been used to build supercomputers since more of the radioactivity has decayed.
      • Re: (Score:3, Interesting)

        by Vellmont (569020)


        Very old processed lead, such as that used for the roofs of some European cathedrals, has been used to build supercomputers since more of the radioactivity has decayed.

        That sounds a bit fishy.

        I _think_ I might be willing to believe the radioactivity of lead, presumably from contamination through some other source radioactive mineral in the ore that decays into radioactive lead. What I have a hard time believing though is that supercomputer makers wouldn't just use non-lead solder, which has been around for

        • by JesseL (107722) *

          Never worked with lead-free solder have you?

          It's only very recently that it's become practical for widespread use and it's still not settled how well it will work in applications that require maximum reliability. The problems with higher melting points, reduced wetting, tin whiskers, appropriate fluxes, etc. took a long time to sort out.

          I'm sure that when a lot of early supercomputers were being built the components used would have been destroyed by the temperatures required to solder without lead.

  • faulty RAM (Score:5, Interesting)

    by mojo-raisin (223411) on Thursday June 24, 2010 @07:21PM (#32685228)

    I've been working with some large microarray datasets recently, and so had to double my computer's memory to 8GB.

    As I've done for years, I went to Fry's to get some Corsair chips... installed F13 64bit to replace my older 32bit distro... and crash-o-matic began. Mostly from Chrome and Mercurial.

    I ran memtester86+ and sure enough, verified my first purchase of faulty memory.

    So, I went back to Fry's and exchanged for another pair of Corsair 2GB chips. This time, I ran memtester86+ first thing... ANOTHER bad set, so back it sent to Fry's.

    *Third* set of memory was Kingston, and a trip through memtester86+ verified no errors. Yay!

    Computer has been stable, too.

    With more and more RAM in computers, my next box will have ECC.

    • by jd (1658)

      As RAM gets ever-larger, densities get ever-greater, and the energy requirements for corruption get ever-smaller, the amount of error-correction needed is going to increase. That seems obvious. Well, to an extent. There are space-rated chips that use lead-lined casing to make them radiation-resistant. Having the motherboard run cooler will decrease the thermally-generated random noise in the system. If you're using a full-immersion system, the coolant might easily absorb some of the cosmic rays not otherwis

    • Re: (Score:3, Informative)

      by Burdell (228580)

      Did you buy all new RAM, or add to existing? If you added to existing, did you test just the new RAM, or with the existing in there as well?

      Lots of RAM has different timings these days, and even when the timing is supposed to be the same, I've seen new RAM cause problems with old RAM to surface (possibly also from temperature changes). I had a system with 2G (2x1G) Corsair RAM, and then I added another 2G (2x1G) of the same model Corsair; the system started crashing. I assumed (as most would) that the pr

    • Almost always it's a memory timing issue. This is especially true for "performance" type memory where the timing tables stored on the SPD chip are not optimal. If you have an ASUS motherboard or some such, they will generally have a plethora of RAM timing and voltages you can adjust to compensate. Unless your a tuner/gamer I would avoid this genre of RAM.

      Not trying to slashvertise here, but purchase your memory from Crucial.com. They provide a nice drill-down menu that will display the *exact* module you ne

  • fascinating (Score:5, Insightful)

    by vux984 (928602) on Thursday June 24, 2010 @07:23PM (#32685254)

    Its interesting to me because my first instinct would have been to assume something got corrupted and my first step would have been to reboot. If the problem persisted through a reboot then I might have gone down the rabbit hole in similiar
    fashion to try and find and fix the root cause.

    There are enough sofware bugs, kernel bugs, driver bugs, hardware hiccups due to marginal equipment, power fluctuation, interference, random noise... and i suppose even cosmic radiation that I would rarely think to spend the time to trace a transient problem unless it was reproducible accross reboots, or at least happened on multiple separate occasions.

  • Some of the nicer boards will tolerate ECC memory being inserted, but won't actually do any meaningful error correction (like scrubbing) - but a disturbingly large number of consumer boards (BIOS limitation perhaps?) don't actually do ANYTHING with ECC memory, and the really cheap ones won't even boot with it present. I used to have the same mindset of purchasing only ECC RAM for the same reason - but the unfortunate truth is that hardware support for it just isn't there without spending $$$ on a decent boa

    • Re: (Score:3, Insightful)

      by Mad Merlin (837387)

      This is one area where AMD is light years ahead of Intel. With Intel, you have to buy a Xeon and a server chipset to have ECC support, which basically is going to run you at least a grand or two just for the CPU and motherboard (at least if you want an i7 based Xeon). AMD on the other hand supports ECC across the board, and you just need a motherboard which supports it, which is most of them (total cost: <$500).

      Thanks for the gouging Intel!

  • by mirix (1649853) on Thursday June 24, 2010 @07:24PM (#32685262)
    I would think it's more likely there is trace radioactive elements in the epoxy the chip is encapsulated in.

    Actually, I recall reading that in the early solid state memory days, they had problems with this. I don't remember what the solution was, but I thought it was to make the circuit somewhat resilient to it, as it was impossible to get 100% neutral epoxy, there's always going to be traces of something radioactive.

    I think they tested the cosmic ray theory by running the same chip with and without lead shielding, and did not find a significant difference in errors, they then assumed it was impurities in the chips themselves decaying.
    • by overshoot (39700)

      I recall reading that in the early solid state memory days, they had problems with this. I don't remember what the solution was, but I thought it was to make the circuit somewhat resilient to it, as it was impossible to get 100% neutral epoxy,

      The worst problem was with ceramic DIP packages -- the really good ones for when you needed reliability (partly because the plastic ones tended to allow moisture to get in, and then condensation on thermal cycling.) The standard ceramic packaging material containe

  • Old, old story (Score:5, Interesting)

    by jmichaelg (148257) on Thursday June 24, 2010 @07:30PM (#32685318) Journal

    Back in the early 80's, HP published a paper on random bit errors in RAM. They looked at chips from a variety of vendors and determined that the RAM coming out of Japan was the most reliable. That paper caused a lot of US RAM vendors to shutter their doors as there was a sea change in purchasing habits.

    A few years later, I ran into John Scully while we were waiting for a flight. I mentioned the paper to him and asked him how Apple could seriously expect to sell a Macintosh specifically aimed at the Scientific community if it didn't have ECC. He blithely said "it's not a problem..." 20+ years hence and most of us still don't have ECC so it seems he was right.

    • Re:Old, old story (Score:5, Informative)

      by Anonymous Coward on Thursday June 24, 2010 @08:00PM (#32685572)

      For a more recent analysis (by folks at Google and U.Toronto) see "DRAM Errors in the Wild: A Large-Scale Field Study" in ACM SIGMETRICS/Performance 09.

      They did an extensive analysis of DRAM failures from many vendors and debunk several myths as well as indicating that the soft error rate can be much higher than previously thought.

      Well worth a read...

      http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

       

      • Re: (Score:3, Informative)

        as well as indicating that the soft error rate can be much higher than previously thought.

        I'm not sure it really does; true they had enormous average (mean) error rates, but it sounded like this was misleading due to an incredibly skewed distribution. Going by the number of servers with zero errors, one error, and multiple errors over a year, and the failures-vs-age data, I came to the conclusion that there's about a 1/5 chance that you'll see one random single-bit error over a typical lifetime (I think I used 5-6 years), but also a similar chance that part of your ram will go bad after a couple

    • Actually all of Apple's "pro" products(ie Mac Pros and XServes) DO have ECC ram, a decision that actually caused quite an uproar in the mac community when it was first introduced(with the g5 powermac IIRC). However it has yet to trickle down into any of Apple's other products, which are all 100%* based on laptop components. Do laptops even have ECC ram? With RAM densities increasing bit errors like the one mentioned in the article are only going to increase.

      *The quad core iMacs have Desktop CPUs in th
  • I'm putting tinfoil hats on all of my servers, right away!
  • I had a mysql replication server which was reading SQL commands from a binary log on a master server. One day after years of operation I noticed an update failed. I didn't see anything at first by looking at the query, but when I looked closely I noticed the query had a single character changed, and of that character only one bit had changed. It was something like a P becoming a Q and thus giving a syntax error.

    True story.

  • Cosmic ray events tend to affect multiple neighboring transistors. For this reason, they tend to affect multiple bits. However, by laying out memory cells so immediate neighbors are from different locations, the ability of single-bit-correction-double-bit-detection (SECDED) methods to detect most events is usually preserved.

    The main concern is for structures with no error correction, such as the gates in the processor pipeline. Several research ideas have been put forward. See here (PDF) [umich.edu] for a good over

  • by talcite (1258586)
    I just read the article and it's quite good. The author goes into detail about how he used a series of checksums and source verification to find the bug, isolate it and fix it. I found it quite fascinating and I recommend reading it if you have a few minutes of time.
  • by fava (513118)

    The article author has obviously never used windows. SOP would be a reboot, which would have solved the problem.

    The whole thing would have taken minutes.

  • by GNUALMAFUERTE (697061) <.almafuerte. .at. .gmail.com.> on Thursday June 24, 2010 @09:00PM (#32686048)

    The guy that posted this is a Ksplice developer. In case you didn't knew, KSplice allows you to patch your running kernel without rebooting. Nice.

    Anyway, this guys sees a random memory error. He conveniently goes on a debugging rampage, while we all know the most logical first step would be rebooting that damn machine. Random memory errors do happen.

    He says he "hasn't gotten around" to memtesting his RAM yet. So, let me get this straight ... he implies that random cosmic rays caused the error, but he hasn't yet tested his ram for what is the most possible cause of the issue?

    Then he goes on to explain that you don't even need to reboot your machine due to damn cosmic radiation. Or kernel updates. Because you have Ksplice.

    Come on.

  • I've seen this (Score:3, Informative)

    by Eil (82413) on Thursday June 24, 2010 @09:21PM (#32686214) Homepage Journal

    A few years ago I came across a thread on a FreeBSD mailing list where a build of some package was failing and the submitter couldn't tell why because he wasn't a developer. The failure was unusual and no one else could reproduce it. Eventually, the problem was traced back to a character in the source differing from the original. The character was a one-bit difference from the correct character, and it was suggested to the submitter that he reboot and memtest his memory. Sure enough, one single bad bit out of around 512MB.

  • ha! (Score:5, Insightful)

    by serbanp (139486) on Thursday June 24, 2010 @09:32PM (#32686266)

    The really impressive thing is that this guy resisted the urge to just reboot his machine. Otherwise, the clues would have vanished and the expr binary would have run again without any issue.

    Maybe that's why the first step one takes when something behaves weird on a Windows system is to reboot it...

    • Re: (Score:2, Informative)

      by mpoon (1382749)
      If you take a look at the website hosting the blog (Ksplice), you might notice that "this guy" works for a company that produces software which eliminates the need for reboots...
  • http://www.jpl.nasa.gov/news/news.cfm?release=2010-151 [nasa.gov]

    Mission managers at NASA's Jet Propulsion Laboratory in Pasadena, Calif., had been operating the spacecraft in engineering mode since May 6. They took this action as they traced the source of the pattern shift to the flip of a single bit in the flight data system computer that packages data to transmit back to Earth.

  • I sure am glad my OS and hardware can detect and correct memory errors on the fly and disable the dimms if need be. I know this is a linux-fest, but Solaris fault management is pretty awesome. I've seen it detect a failing cpu, evacuate the memory attached to it and disable the cpu without a hiccup.

  • by spydum (828400) on Thursday June 24, 2010 @10:26PM (#32686496)

    I know, cosmic rays sound so much cooler, but it's far more likely he has some crappy memory and/or his memory refresh timings are too high.

    DRAM memory cells have to be refreshed pretty often (anywhere from 7.8usec-12usec), otherwise they become unreliable. If his BIOS has the memory timings set to something obscurely long, it may be there are specific rows/cells on his DRAM modules that are too weak to read after bleeding off a bit of charge. Changing the refresh timing would likely improve the situation, causing the memory to refresh it's state more often.

"People should have access to the data which you have about them. There should be a process for them to challenge any inaccuracies." -- Arthur Miller

Working...