Tracking Down a Single-Bit RAM Error 277

Posted by timothy on Thursday June 24, 2010 @07:10PM from the you'll-need-a-nice-microscope dept.

Hanji writes "We have discussed here before the potential effects of and protections against cosmic ray radiation, but for the average computer user, it's an obscure threat that doesn't affect them in any real way. Well, here's a blog post that describes a strange segfault and, after extensive debugging, traces it down to a single bit flip, probably caused by a stray cosmic ray. Lots of helpful descriptions of Linux debugging techniques in this one, and a pretty clear demonstration that this can be a real problem. I know I'm never buying a desktop without ECC RAM ever again!" The author acknowledges that it might not have been a cosmic ray-based error, but the troubleshooting steps are interesting no matter what the cause.

Tracking Down a Single-Bit RAM Error

This discussion has been archived. No new comments can be posted.

Search 277 Comments Log In/Create an Account

Comments Filter:

It's not cosmic. It's from the die/package (Score:5, Informative)

by EmagGeek ( 574360 ) writes: on Thursday June 24, 2010 @07:19PM (#32685214) Journal

Soft errors in DRAM are far more likely to be the result of alpha particle decay from materials in the die and packaging.

Re:erm.... (Score:4, Informative)

by JesseL ( 107722 ) * writes: on Thursday June 24, 2010 @07:21PM (#32685226) Homepage Journal

Would it really be so hard to read the article before posting?

Re:RAM error? (Score:3, Informative)

by marcansoft ( 727665 ) writes: <hector@marcansoft . c om> on Thursday June 24, 2010 @07:43PM (#32685428) Homepage

The perl script will stay cached until something else pushes it out of RAM or until you reboot the system. In general, files are loaded once and stick around for quite a while unless you're low on RAM. In my case, it stayed cached while I investigated it, and I could see the broken character with various viewers. Bad RAM could also cause an intermittent issue if it happened to affect memory used by the Perl interpreter to load the file (that would change each time), but in this case it affected the kernel's file cache, which is quite persistent in the medium or even long term.
I probably had the RAM error for a long time and never noticed. It likely caused a few kernel panics and segfaults along the way, but I probably attributed those to stuff like buggy X11 drivers. The broken Perl script was the first odd thing that I could directly attribute to a RAM problem, later confirmed with memtest86 (the broken bit also matched the change that happened to the character).

Re:RAM error? (Score:3, Informative)

by Chris Burke ( 6130 ) writes: on Thursday June 24, 2010 @07:44PM (#32685434) Homepage

Would the perl script be loaded at the same address in RAM every time? Wouldn't that likely be a one-time unrepeatable problem?
If the stuck bit was in the file cache, then it would be repeatable for as long as the script stayed cached, plus you could load the file up in a text editor and see the changed character, etc. Then it would mysteriously go away.

Re:All data channels are noisy (Score:3, Informative)

by Chris Burke ( 6130 ) writes: on Thursday June 24, 2010 @07:52PM (#32685500) Homepage

And no doubt in these super high transistor count and clock frequency CPUs and chips we are using these days there must be devices and methods used inside them to keep the logic transfer and computation validity on the straight and narrow.
Other than ECC on the cache arrays... No. Not a scrap.
If you want reliability on every internal signal and register against cosmic ray strikes, because you're a military or aerospace contractor, you pay boku bucks for it, settle for having way less than what we would currently call performance. And even then I highly doubt anyone is actually putting ECC on each and every bus or set of latches. You just radiation harden the device as much as possible, and then use three of them so if one gets the wrong answer because of a particle strike, the other two will out-vote it.

Also (Score:5, Informative)

by Sycraft-fu ( 314770 ) writes: on Thursday June 24, 2010 @07:56PM (#32685526)

Disks have a lot, and I mean a LOT of ECC on them. It is not a situation of "I need to write a 1 so I'll place one at this location on the drive." They use a complex encoding scheme so that bit errors on the disk don't yield data errors to the user.
Then there's the fact that bits aren't even stored as bits really. All current drives use (E)PRML which is (Enhanced) Partial Response Maximum Likelihood. What this means is bits aren't encoded as a high-low state or FM wave or any of that. They are written using flux reversals, but the level is not carefully controlled, it can't be. So when you read the data the drive actually looks at an analogue wave. It encodes the partial response it gets, and then finds the maximumly likely pattern that matches.
Sounds like voodoo but works really well. Things are not simple thresholds or the like, it is a complex system and ends up being quite robust and resilient to error.
So it is highly unlikely that you had a bit flipped on a disk. Would require some amazing circumstances to happen. The RAM error is far more likely. Not just the cosmic ray thing but, as the parent noted, bad RAM. Normally when RAM fails, it fails catastrophically and it is immediately apparent. Not always though. It can not only fail on single bit locations, but only during certian ops. That is why memtest does so many different tests. One kind might works fine, another might fail. Rare, but I've seen it on a few systems.

Re:Old, old story (Score:5, Informative)

by Anonymous Coward writes: on Thursday June 24, 2010 @08:00PM (#32685572)

For a more recent analysis (by folks at Google and U.Toronto) see "DRAM Errors in the Wild: A Large-Scale Field Study" in ACM SIGMETRICS/Performance 09.
They did an extensive analysis of DRAM failures from many vendors and debunk several myths as well as indicating that the soft error rate can be much higher than previously thought.
Well worth a read...
http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

Re:Also (Score:3, Informative)

by marcansoft ( 727665 ) writes: <hector@marcansoft . c om> on Thursday June 24, 2010 @08:24PM (#32685796) Homepage

However, single-bit errors are possible with faulty disk hardware. The cache RAM on the disk or its interface can be flaky, and for PATA disks a bad cable can cause single-bit errors. SATA disks usually catch IO errors since they use a more complicated encoding and make use of checksums.

Re:faulty RAM (Score:3, Informative)

by Burdell ( 228580 ) writes: on Thursday June 24, 2010 @08:31PM (#32685844)

Did you buy all new RAM, or add to existing? If you added to existing, did you test just the new RAM, or with the existing in there as well?
Lots of RAM has different timings these days, and even when the timing is supposed to be the same, I've seen new RAM cause problems with old RAM to surface (possibly also from temperature changes). I had a system with 2G (2x1G) Corsair RAM, and then I added another 2G (2x1G) of the same model Corsair; the system started crashing. I assumed (as most would) that the problem was the new RAM. I ran memtest86+ for about 18 hours on just the new RAM and had no problems. I stuck the original 2G back in and the system crashed; I ran memtest86+ on just the old RAM; no problem. With all 4 sticks in, memtest86+ would show errors. By moving sticks around and figuring out the address mapping on my system, I tracked it down to one of the original sticks. I then ran memtest86+ for about 48 hours on just that stick, and it did eventually show an error (Corsair replaced it and I have had no more problems).
RAM generates a good bit of heat these days, and adding RAM generates even more heat in a small space. My faulty RAM has the heat spreaders included, but the motherboard puts the RAM slots so close together there's still little space for heat to dissipate.

Re:Ugh, single bit errors (Score:4, Informative)

by rudy_wayne ( 414635 ) writes: on Thursday June 24, 2010 @08:36PM (#32685876)

I'm not sure why you'd want ECC ram in a desktop, unless it's some sort business critical machine that you're willing to spend 5 or 6 times what a normal desktop costs.
This may have been true at one time, but ECC RAM is no longer that expensive. I just looked at prices on Newegg:
8 GB DDR3 $214.99
8 GB DDR3 ECC $274.99
In some cases, depending on the brand and the speed, ECC is actually *CHEAPER*.

Re:Old, old story (Score:3, Informative)

by Timothy Brownawell ( 627747 ) writes: <tbrownaw@prjek.net> on Thursday June 24, 2010 @08:38PM (#32685894) Homepage Journal

as well as indicating that the soft error rate can be much higher than previously thought.
I'm not sure it really does; true they had enormous average (mean) error rates, but it sounded like this was misleading due to an incredibly skewed distribution. Going by the number of servers with zero errors, one error, and multiple errors over a year, and the failures-vs-age data, I came to the conclusion that there's about a 1/5 chance that you'll see one random single-bit error over a typical lifetime (I think I used 5-6 years), but also a similar chance that part of your ram will go bad after a couple years and give you a sudden flood of errors. It would have been very nice if they'd counted servers with 0,1,2,3,...10-20, 20-50, ... etc errors/year (preferably with a pretty graph), instead of only breaking it into zero, one, many.

Re:Ugh, single bit errors (Score:2, Informative)

by hawguy ( 1600213 ) writes: on Thursday June 24, 2010 @08:50PM (#32685980)

I went to Dell's site and configured a few Dell Desktops (non-ECC) and Workstations (with ECC), and prices were similar for comparable systems. Though the Workstations that supported ECC didn't support many low-end processors, so if i didn't want ECC and didn't care about processor performance I could have gotten a desktop for about 60% of the price of the cheapest workstation with ECC. But I didn't see a 5x increase for ECC.

Re:Ugh, single bit errors (Score:3, Informative)

by besalope ( 1186101 ) writes: on Thursday June 24, 2010 @09:04PM (#32686084)

You'll also need a consumer-level motherboard with ECC support. Which are not common, which means you'll be stuck with a server-grade motherboard which costs more, has potential to change: cpu compatibility, case compatibility, and features on the board itself.
There's alot more to making the change from non-ECC to ECC than just swapping out your ram.

Re:Ugh, single bit errors (Score:3, Informative)

by Timothy Brownawell ( 627747 ) writes: <tbrownaw@prjek.net> on Thursday June 24, 2010 @09:09PM (#32686120) Homepage Journal

You'll also need a consumer-level motherboard with ECC support. Which are not common, which means you'll be stuck with a server-grade motherboard
Or, you know, go AMD. Because they don't limit ECC to only server parts.

Re:It's not cosmic. It's from the die/package (Score:3, Informative)

by Anonymous Coward writes: on Thursday June 24, 2010 @09:20PM (#32686204)

People don't realize that lead is mildly radioactive, and the decay from solders on the connectors or chassis can also cause bit flips. Very old processed lead, such as that used for the roofs of some European cathedrals, has been used to build supercomputers since more of the radioactivity has decayed.
I'm unclear as how this "processing" of the lead has reduced its natural radiaoctivity...
Pb-210 is in the U-238 and Rn-222 decay chains, so lead ore in the ground has a constant source of Pb-210 being generated due to uranium contamination. Likewise, radon gas can seep into the lead ore deposits and provide a fresh influx of Pb-210. Once the lead is smelted and purified, the uranium contanimation is removed and it's not being exposed to radon so the number of Pb-210 atoms in the sample starts decreasing significantly.

I've seen this (Score:3, Informative)

by Eil ( 82413 ) writes: on Thursday June 24, 2010 @09:21PM (#32686214) Homepage Journal

A few years ago I came across a thread on a FreeBSD mailing list where a build of some package was failing and the submitter couldn't tell why because he wasn't a developer. The failure was unusual and no one else could reproduce it. Eventually, the problem was traced back to a character in the source differing from the original. The character was a one-bit difference from the correct character, and it was suggested to the submitter that he reboot and memtest his memory. Sure enough, one single bad bit out of around 512MB.

Re:Ugh, single bit errors (Score:3, Informative)

by nabsltd ( 1313397 ) writes: on Thursday June 24, 2010 @09:40PM (#32686314)

Or, you know, go AMD. Because they don't limit ECC to only server parts.
Or, just buy any one of a half-dozen motherboards costing less than $200 [newegg.com] and add a Xeon [newegg.com] that is priced within 5% of the equivalent spec non-Xeon.
Sure, these might not be the best motherboards for gaming (although they are pretty competitive compared to other socket 1156 motherboards), but for a workstation doing everything else, they're great.
And, this way you get a motherboard that is thoroughly tested with ECC RAM (as that's what is expected to be used), and likely far better BIOS control of the ECC.

Re:ha! (Score:2, Informative)

by mpoon ( 1382749 ) writes: on Thursday June 24, 2010 @09:48PM (#32686360)

If you take a look at the website hosting the blog (Ksplice), you might notice that "this guy" works for a company that produces software which eliminates the need for reboots...

Roman ingots to shield particle detector (Score:4, Informative)

by drerwk ( 695572 ) writes: on Thursday June 24, 2010 @10:20PM (#32686466) Homepage

Roman ingots to shield particle detector
http://www.nature.com/news/2010/100415/full/news.2010.186.html [nature.com]

Memory Refresh Timing more likely (Score:3, Informative)

by spydum ( 828400 ) writes: on Thursday June 24, 2010 @10:26PM (#32686496)

I know, cosmic rays sound so much cooler, but it's far more likely he has some crappy memory and/or his memory refresh timings are too high.
DRAM memory cells have to be refreshed pretty often (anywhere from 7.8usec-12usec), otherwise they become unreliable. If his BIOS has the memory timings set to something obscurely long, it may be there are specific rows/cells on his DRAM modules that are too weak to read after bleeding off a bit of charge. Changing the refresh timing would likely improve the situation, causing the memory to refresh it's state more often.

Re:Cosmic rays, my ass. Occam's Razor time. (Score:3, Informative)

by w0mprat ( 1317953 ) writes: on Friday June 25, 2010 @12:55AM (#32687224)

Occam's DIMM I'm affraid. I had a stick of DDR2 that had a stuck bit that caused almost exactly the same issue as TFA. As a hardware geek, not a *nix geek with time to waste, I went straight to memtest86, and there it was, one single stuck bit.

Although interesting, TFA it is without a doubt the most pedantic and roundabout way I've ever read of establishing your rig is not stable.

From TFSA:
And in fact, since that incident, I've had several other, similar problems. I haven't gotten around to memtesting my machine, but that does suggest I might just have a bad RAM chip on my hands.
Yeah he has a stuck or semi-stuck bit and a hour or two of his life he won't get back.

In such a circumstance I've found underclocking and overvolting the DIMM might coax it to work again but it's best to RMA or bin it.

Re:Ugh, single bit errors (Score:4, Informative)

by billcopc ( 196330 ) writes: <vrillco@yahoo.com> on Friday June 25, 2010 @01:24AM (#32687344) Homepage

Depends on the type of desktop. ECC these days doesn't cost much more than non-ECC... Dell and HP may not want to admit it, but I buy ECC DDR3 all the time as I build a lot of white-box servers, and frankly even the lamest "gaming" Ram carries a higher premium than ECC.
The tricky thing is that while most (all?) current AMD boards can take ECC ram (unbuffered, not registered), no consumer Intel boards can handle ECC - you need to step up to a Xeon processor and chipset. Luckily the single-processor setups don't cost all that much more than their mid-range consumer equivalents, but you do have to sacrifice buzzy features like USB 3.0, SLI/Crossfire, eSATA and overclocking. One exception to this is the EVGA Classified SR-2, which has absolutely everything, but it's $600 and requires a special oversized chassis (or a lot of dremel work).
I'm going to put this out there: if someone is genuinely concerned about bit errors to a degree where the loss of work due to a minor crash or reboot is significant enough, go ahead and spend an extra 10% on ECC. Even if you pack that board with 96gb of memory, it's still cheaper than six months of therapy and thorazine :P

Re:Takes me back (Score:2, Informative)

by Thanatos81 ( 1305243 ) writes: on Friday June 25, 2010 @02:45AM (#32687626)

To be fair, Gates never said that line. http://en.wikiquote.org/wiki/Bill_Gates#Misattributed [wikiquote.org]

Re:Ugh, single bit errors (Score:2, Informative)

by Anonymous Coward writes: on Friday June 25, 2010 @02:55AM (#32687672)

By only supporting ECC on their expensive server processors.

AMD vs. Intel with ECC, prices in Germany (Score:2, Informative)

by Lonewolf666 ( 259450 ) writes: on Friday June 25, 2010 @04:00AM (#32687860)

Checking at alternate.de (not the cheapest online shop, but good for comparisons because they have both consumer and server parts):
Intel:
The only Socket 775 boards that support ECC seem to be those with the 32xx MCH chipset. Starting at 195 Euros (Asus P5BV-C).
For Socket 1156, the consumer chipsets allow ECC but you still need to find a board with BIOS support. Sadly Alternate does not list the ECC support status, but you might find one that supports ECC among the cheaper ones for 80-90 Euros. You do, however, need a Xeon which starts at 213 Euros (Xeon X3430, 4 x 2.4 GHz)
So mainboard plus a quad CPU costs you around 300 Euros at Alternate.
AMD:
Board situation (Socket AM3) similar to Intel's Socket 1156, boards with ECC support are available for 80-90 Euros.
Unlike Intel, even cheap desktop CPUs support ECC. As a cheap quad, Alternate offers the Athlon II X4 635 for 108 Euros.
So mainboard plus quad CPU costs you around 200 Euros, 100 less than with Intel.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Tracking Down a Single-Bit RAM Error 277

Tracking Down a Single-Bit RAM Error More Login

Tracking Down a Single-Bit RAM Error

It's not cosmic. It's from the die/package (Score:5, Informative)

Re:erm.... (Score:4, Informative)

Re:RAM error? (Score:3, Informative)

Re:RAM error? (Score:3, Informative)

Re:All data channels are noisy (Score:3, Informative)

Also (Score:5, Informative)

Re:Old, old story (Score:5, Informative)

Re:Also (Score:3, Informative)

Re:faulty RAM (Score:3, Informative)

Re:Ugh, single bit errors (Score:4, Informative)

Re:Old, old story (Score:3, Informative)

Re:Ugh, single bit errors (Score:2, Informative)

Re:Ugh, single bit errors (Score:3, Informative)

Re:Ugh, single bit errors (Score:3, Informative)

Re:It's not cosmic. It's from the die/package (Score:3, Informative)

I've seen this (Score:3, Informative)

Re:Ugh, single bit errors (Score:3, Informative)

Re:ha! (Score:2, Informative)

Roman ingots to shield particle detector (Score:4, Informative)

Memory Refresh Timing more likely (Score:3, Informative)

Re:Cosmic rays, my ass. Occam's Razor time. (Score:3, Informative)

Re:Ugh, single bit errors (Score:4, Informative)

Re:Takes me back (Score:2, Informative)

Re:Ugh, single bit errors (Score:2, Informative)

AMD vs. Intel with ECC, prices in Germany (Score:2, Informative)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot