Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Bug Ubuntu Hardware

Tracking Down a Single-Bit RAM Error 277

Hanji writes "We have discussed here before the potential effects of and protections against cosmic ray radiation, but for the average computer user, it's an obscure threat that doesn't affect them in any real way. Well, here's a blog post that describes a strange segfault and, after extensive debugging, traces it down to a single bit flip, probably caused by a stray cosmic ray. Lots of helpful descriptions of Linux debugging techniques in this one, and a pretty clear demonstration that this can be a real problem. I know I'm never buying a desktop without ECC RAM ever again!" The author acknowledges that it might not have been a cosmic ray-based error, but the troubleshooting steps are interesting no matter what the cause.
This discussion has been archived. No new comments can be posted.

Tracking Down a Single-Bit RAM Error

Comments Filter:
  • by Darkness404 ( 1287218 ) on Thursday June 24, 2010 @07:21PM (#32685236)
    Wouldn't that be more likely caused by fluctuations in the power supply though? I'm not an electrical engineer nor an expert on earthquakes, but wouldn't it be possible that a quick loss of power or too high of power for a split second could mess up the data on the RAM?
  • fascinating (Score:5, Insightful)

    by vux984 ( 928602 ) on Thursday June 24, 2010 @07:23PM (#32685254)

    Its interesting to me because my first instinct would have been to assume something got corrupted and my first step would have been to reboot. If the problem persisted through a reboot then I might have gone down the rabbit hole in similiar
    fashion to try and find and fix the root cause.

    There are enough sofware bugs, kernel bugs, driver bugs, hardware hiccups due to marginal equipment, power fluctuation, interference, random noise... and i suppose even cosmic radiation that I would rarely think to spend the time to trace a transient problem unless it was reproducible accross reboots, or at least happened on multiple separate occasions.

  • by Mad Merlin ( 837387 ) on Thursday June 24, 2010 @07:42PM (#32685422) Homepage

    This is one area where AMD is light years ahead of Intel. With Intel, you have to buy a Xeon and a server chipset to have ECC support, which basically is going to run you at least a grand or two just for the CPU and motherboard (at least if you want an i7 based Xeon). AMD on the other hand supports ECC across the board, and you just need a motherboard which supports it, which is most of them (total cost: <$500).

    Thanks for the gouging Intel!

  • by Anonymous Coward on Thursday June 24, 2010 @08:04PM (#32685622)

    You are on the right track. As someone with over a quarter century of background in combined embedded software and hardware design (the most recent decade for life-dependant systems), it always amazes me how quickly pseudo-technical people jump to wild speculation for observations that they cannot explain.

    They fail to understand that a hardware system is an imperfect representation of the theory (probably the biggest failure in the schooling of software developers and even some hardware is to get this message into their heads). While they feel comfort in the theory of a binary system, they utterly fail to understand that our real systems, like us, are imperfect and, like us, live in an analog world. Simple things like temperature variations, noise from common (rather than cosmic) sources, marginal design timing, imperfect components, simple intermittents, etc., are 10^24 times more likely the cause.

    But they're not as fascinating as wild speculation, are they?

  • by Timothy Brownawell ( 627747 ) <tbrownaw@prjek.net> on Thursday June 24, 2010 @08:55PM (#32686022) Homepage Journal

    I'm not sure why you'd want ECC ram in a desktop, unless it's some sort business critical machine that you're willing to spend 5 or 6 times what a normal desktop costs. For day to day use, ECC is overkill.

    My desktop has 8GB of ECC in it. This cost I think $40 more than non-ECC, and meant I got an Althon II x4 instead of a Core i5. That "5 or 6 times what a normal desktop costs" is either bullshit or Intel-onlyism (which is just another kind of bullshit).

  • Re:Reboot? (Score:1, Insightful)

    by Anonymous Coward on Thursday June 24, 2010 @09:08PM (#32686104)
    And leave you in a state of utter ignorance. It isn't about solving it, it's about understanding it.
  • ha! (Score:5, Insightful)

    by serbanp ( 139486 ) on Thursday June 24, 2010 @09:32PM (#32686266)

    The really impressive thing is that this guy resisted the urge to just reboot his machine. Otherwise, the clues would have vanished and the expr binary would have run again without any issue.

    Maybe that's why the first step one takes when something behaves weird on a Windows system is to reboot it...

Anyone can make an omelet with eggs. The trick is to make one with none.

Working...