Follow Slashdot stories on Twitter

Why I'm Usually Unnerved When Modern SSDs Die on Us (utoronto.ca) 358

Posted by msmash on Tuesday December 11, 2018 @12:16PM from the SSDDeathDisturbing dept.

Chris Siebenmann, a Unix Systems Administrator at University of Toronto, writes about the inability to figure out the bottleneck when an SSD dies: What unnerves me about these sorts of abrupt SSD failures is how inscrutable they are and how I can't construct a story in my head of what went wrong. With spinning HDs, drives might die abruptly but you could at least construct narratives about what could have happened to do that; perhaps the spindle motor drive seized or the drive had some other gross mechanical failure that brought everything to a crashing halt (perhaps literally). SSDs are both solid state and opaque, so I'm left with no story for what went wrong, especially when a drive is young and isn't supposed to have come anywhere near wearing out its flash cells (as this SSD was).

(When a HD died early, you could also imagine undetected manufacturing flaws that finally gave way. With SSDs, at least in theory that shouldn't happen, so early death feels especially alarming. Probably there are potential undetected manufacturing flaws in the flash cells and so on, though.) When I have no story, my thoughts turn to unnerving possibilities, like that the drive was lying to us about how healthy it was in SMART data and that it was actually running through spare flash capacity and then just ran out, or that it had a firmware flaw that we triggered that bricked it in some way.

This discussion has been archived. No new comments can be posted.

Why I'm Usually Unnerved When Modern SSDs Die on Us

Load All Comments

Search 358 Comments Log In/Create an Account

Comments Filter:

With spinning disks, you do not know either (Score:5, Insightful)

by gweihir ( 88907 ) writes: on Tuesday December 11, 2018 @12:21PM (#57786276)

Seriously, you do not. You may know the end-result sometimes (head-crash), but the root-cause is usually not clear.
So get over it. It is a new black-box replacing an older black-box.

Share
twitter facebook
- Re:With spinning disks, you do not know either (Score:5, Insightful)
  
  by 110010001000 ( 697113 ) writes: on Tuesday December 11, 2018 @12:25PM (#57786300) Homepage Journal
  
  What is unnerving is that a guy from the Department of Computer Science thinks that SSDs are theoretically immune to manufacturing failures.
  
  Parent Share
  twitter facebook
  - Re:With spinning disks, you do not know either (Score:5, Insightful)
    
    by froggyjojodaddy ( 5025059 ) writes: on Tuesday December 11, 2018 @12:33PM (#57786354)
    
    From the article:
    
    "Further, when I have no narrative for what causes SSD failures, it feels like every SSD is an unpredictable time bomb. Are they healthy or are they going to die tomorrow? "
    
    Emphasis mine. I feel like this guy has opportunities to improve his coping mechanism. For someone in Computer Sciences, it seems like he's way too worried about this. I'm not trying to be mean, but it's like if I got into a car accident and then questioned the entire safety design of all vehicles rather than just taking a few steps back and understanding it's a freak event, but not a totally unexpected one. If you've been driving for 30 years, statistically, you're likely to get into at least one accident, even if it's not your fault
    
    Parent Share
    twitter facebook
    - Re: (Score:3)
      
      by alvinrod ( 889928 ) writes:
      
      Or to learn what causes SSD's to fail. Just because something appears unpredictable doesn't mean that it is so. If he doesn't have the time to devote to investigating this issue and acquire any requisite knowledge that will help him to uncover the truth, then he probably shouldn't be squandering any of that precious time whining or worrying about things that are out of his control.
    - Re:With spinning disks, you do not know either (Score:5, Interesting)
      
      by Stonent1 ( 594886 ) writes: <stonent&stonent,pointclark,net> on Tuesday December 11, 2018 @01:05PM (#57786622) Journal
      
      Ok, I'm in IT and it unnerves me. I've had numerous computers have an SSD totally die and lose all data with no smart warnings in the last few years. (Not me personally, I mean people at our organization)
      
      Parent Share
      twitter facebook
      - Re:With spinning disks, you do not know either (Score:5, Insightful)
        
        by Anonymous Coward writes: on Tuesday December 11, 2018 @01:39PM (#57786922)
        
        All for the SAME reason- the wrong type of cell failed, and the crappy software doesn't know how to recover. The software systems of the SSD and the OS driver side are written by idiots.
        A low level tool that knows your particular SSD driver chipset could trivially access the vast majority of flash cells on your SSD drive. But what good is that FACT if the tools are not readily available.
        And SMART warning do NOT apply to SSD drives. SMART is for electro-mechanical systems with statistical models of gradual failure. SMART is FAKED for SSD.
        A catastrophic SSD failure is when the 'wrong' memory cell dies, and the software locks up. Since all memory cells are equally likely to die at some point, this is a terrible fault of many of these drives.
        
        Parent Share
        twitter facebook
        
        Re: (Score:2)
        
        by ShanghaiBill ( 739463 ) writes:
        
        All for the SAME reason- the wrong type of cell failed, and the crappy software doesn't know how to recover.
        So it is basically bad software? Are there SSD brands with less crappy software than others?
        Is there data on reliability, like there is for HDDs?
        To be fair, I believe this is becoming less of a problem. I saw SSDs fail often in the early days of flash, but not recently.
        
        Re: (Score:3)
        
        by SuperKendall ( 25149 ) writes:
        
        This is why I try to buy more expensive and higher performance SSD drives (like the Samsung EVO line) - but I have to admit I have absolutely zero idea if the chipset on the more expensive drives is really any better at all. It just seems likely the design would be better in some ways or a bit more fault tolerant.
        Even that strategy I know can fail though, a few years back one of the most expensive Sandisk Pro SD cards just died out of the blue. It happened while I was at a photography convention where San
        
        Re:With spinning disks, you do not know either (Score:5, Interesting)
        
        by gweihir ( 88907 ) writes: on Tuesday December 11, 2018 @03:28PM (#57787640)
        
        Well, I originally bought OCZ. Today _all_ of 5 OCZ drives I got are stone-dead. After that I moved to Samsung, mostly "Pro". They are all still working fine and some are older now than the first OCZ when it died. So yes, it makes a difference. Incidentally, Samsung had excellent reliability in their spinning drives as well. It seems they just care more about quality and reputation.
        That said, I find it sad that you cannot get "high reliability" SSDs where you basically can forget about the risk of them dying. I am talking reliability levels like a typical CPU here. It seems the market for that is just not there.
        
        Parent Share
        twitter facebook
        
        Re: (Score:3)
        
        by WhoBeDaPlaya ( 984958 ) writes:
        
        You must have missed how Samsung royally screwed up with the 840 and 840 EVO firmware. Or on the mechanical side of things, lookup how they messed up the SpinPoint F4's firmware and tried to hide it ;)
        Not biased against Samsung or anything, as I still have several SpinPoint F3s in service, as well as a bunch of 840 Pros and 850 EVOs.
        
        Re: (Score:3)
        
        by Aighearach ( 97333 ) writes:
        
        Nope. You're not paying for different control ICs, where you actually get something from paying more it would be higher speed or higher yield rates on the memory chips.
        Higher yield rates will translate into lower runtime failure rates.
        You're not going to learn much from the wrong side of the controller, because customers at all levels refuse to pay extra for built-in forensics. And you'd have to choose between extra silicon that normally isn't even used, or extra power use. It won't be free.
        You have to get
        
        Re:With spinning disks, you do not know either (Score:5, Insightful)
        
        by viperidaenz ( 2515578 ) writes: on Tuesday December 11, 2018 @03:37PM (#57787698)
        
        SMART should be able to provide the number of remapped sectors. There should be manufacturer specific counters for the amount of over provisioning that is left for remapping too. That should tell you precisely when you should plan to replace an SSD due to age.
        How hard would it be to notify something that the drive can't handle any more dead cells, so should not be written to any more? Or that it is down to x% of spare nand?
        
        Parent Share
        twitter facebook
        
        Re: (Score:3)
        
        by thegarbz ( 1787294 ) writes:
        
        So much wrong in so little post, where to start:
        The software systems of the SSD and the OS driver side are written by idiots.
        Hardly. The software systems of SSD are written by people who know SSDs well. That you bought an OCZ drive is just unlucky. Firmware related failures were only common in the early days of SSDs.
        A low level tool that knows your particular SSD driver chipset could trivially access the vast majority of flash cells on your SSD drive.
        
        And would know none of what to do with it because wear leveling is not something you can predict and decode later. You can only store it. If the component which stores this knowledge is dead then nothing can save you.
        And SMART warning do NOT apply to SSD drives. SMART is for electro-mechanical systems with statistical models of gradual failure. SMART is FAKED for SSD.
        SMART is a system for drive reporting metrics. Nothing
      - Re: (Score:2)
        
        by account_deleted ( 4530225 ) writes:
        
        Comment removed based on user account deletion
        
        Re: (Score:3)
        
        by Immerman ( 2627577 ) writes:
        
        I believe Intel SSDs are programmed to "self brick" when they fail, or at least they used to be. I remember thinking that was a spectacularly stupid way to fail, and the read-only mode would be much preferable. Yes, your computer will likely crash hard in short order either way, but at least with read-only mode you could get (most of) your most recent data off it
        
        Re: (Score:3)
        
        by dgatwood ( 11270 ) writes:
        
        It's usually because of the controller or RAM / Cache errors in processing that corrupts the firmware or dynamic LBA flash block allocation table (database). This renders the reset of the NAND flash partially or totally inaccessible. Quality "prosumer" drives are supposed to have extra hardware (capacitance) to prevent half-writes upon a dirty shutdown (abrupt loss of power). But regardless, any corruption on write-back can render the drive "bricked".
        And by this, you mean that some really bad SSD manufactur
    - Backup your data frequently (Score:3, Insightful)
      
      by Solandri ( 704621 ) writes:
      
      Backup your data frequently. Stop worrying. Is that so hard?
    - - Re:With spinning disks, you do not know either (Score:5, Insightful)
        
        by dgatwood ( 11270 ) writes: on Tuesday December 11, 2018 @09:01PM (#57789854) Homepage Journal
        
        I think you may be missing his point. I've had SSD's die on me as well with absolutely no warning. What's unnerving about it is you have no idea why it failed. Good engineers like failure analysis; it helps determine if you're buying a crappy product, running your product out of spec, or any number of other metrics which can inform future purchases.
        Statistically, without even knowing what the particular product was, I can tell you what caused it: RoHS.
        The change from lead-based solder to lead-free solder is one of the major causes of premature electronics failures — probably more common than all other causes put together. Between tin whiskers, cold solder joints, and stress fractures caused by thermal expansion of component packages, the RoHS lead-free solder rule is a clear example of environmentalism gone amok. Instead of improving our environment by reducing the amount of lead going out into the world, it has, IMO, made our environment worse by dramatically increasing the amount of hardware discarded as junk long before it otherwise would have been.
        
        Parent Share
        twitter facebook
  - Re:With spinning disks, you do not know either (Score:5, Funny)
    
    by ctilsie242 ( 4841247 ) writes: on Tuesday December 11, 2018 @12:57PM (#57786562)
    
    Could be worse. At a previous job, I've had someone demand "7200 RPM SSDs", and no amount of explaining could change the person's mind.
    
    Parent Share
    twitter facebook
    - Re:With spinning disks, you do not know either (Score:5, Funny)
      
      by Sponge Bath ( 413667 ) writes: on Tuesday December 11, 2018 @01:45PM (#57786972)
      
      Tell this person you could only find 7199 RPM SSDs, but if they spin in an office chair while using the system it will make up the difference.
      
      Parent Share
      twitter facebook
  - Re:With spinning disks, you do not know either (Score:5, Funny)
    
    by jellomizer ( 103300 ) writes: on Tuesday December 11, 2018 @01:36PM (#57786902)
    
    That is why I always stick to real to real 9 track paper tape. If you can't see the bits you just can't trust it.
    
    Parent Share
    twitter facebook
  - Re: (Score:3)
    
    by R3d M3rcury ( 871886 ) writes:
    
    Exactly. I've had bad DRAM before which caused the occasional inexplicable crash. I don't see any reason why SSDs would somehow be immune from this.
    That said, most SMART codes are for mechanical hard drives. I wouldn't be surprised to discover that there isn't really a good way to test reliability for SSDs, so the SMART codes always come back as "A-OK!"
  - Re: (Score:2)
    
    by Headw1nd ( 829599 ) writes:
    
    The author mentions manufacturing errors as a possible source, but I think his question is an error in what, and if it's an error on silicon, why would it only show up after months of operation? Some people have more curiosity about the things they use, and want more of an explanation than "oh sometimes they just fail."
    - Re: (Score:3)
      
      by 110010001000 ( 697113 ) writes:
      
      Thats nice, but that isn't relevant to what I wrote. I commented that it is unnerving that he thinks that SSDs are theoretically immune to manufacturing failures. There are a lot of reasons why a SSD can fail. Soldered joints can fail. There are various bonds that can also fail.
  - - Re: (Score:2)
      
      by 110010001000 ( 697113 ) writes:
      
      That sounds about right for HP in 2018.
  - - Re: With spinning disks, you do not know either (Score:4, Interesting)
      
      by omnichad ( 1198475 ) writes: on Tuesday December 11, 2018 @01:13PM (#57786704) Homepage
      
      Older SSDs didn't even have a wear-leveling SMART attribute or total host writes attribute. Some of the cheaper ones probably still don't. So there is no way to see how close you're getting to the estimated upper limit. There is a pretty clear progression on the newer drives. With hard drives, mechanical failure is actually less predictable than SSD wear-out (defects aside).
      
      Parent Share
      twitter facebook
      - Re: With spinning disks, you do not know either (Score:2)
        
        by Type44Q ( 1233630 ) writes:
        
        The word you were looking for is "trim."
        Cha-ching.
        
        Re: (Score:2)
        
        by omnichad ( 1198475 ) writes:
        
        No, it's not. TRIM has nothing to do with wear leveling - and especially monitoring it over time, except that it might happen on a more efficient schedule.
- Re:With spinning disks, you do not know either (Score:5, Informative)
  
  by AmiMoJo ( 196126 ) writes: on Tuesday December 11, 2018 @12:44PM (#57786450) Homepage Journal
  
  Often SSD failures can be predicted or at least diagnosed by looking at SMART data. That's what it's for, after all. Some manufacturers provide better data than others.
  Like HDDs, sometimes the electronics die too. Usually a power supply issue. Can be tricky to diagnose. SSDs are slightly worse as with HDDs you can often replace the controller PCB and get them working again, where as SSDs are a single PCB with the controller and memory.
  
  Parent Share
  twitter facebook
  - Re:With spinning disks, you do not know either (Score:4, Informative)
    
    by Comboman ( 895500 ) writes: on Tuesday December 11, 2018 @01:32PM (#57786876)
    
    Mod parent up. The most common cause of a sudden, unexplained failure for both HDs and SSDs is a failure of the controller rather than the media.
    
    Parent Share
    twitter facebook
  - Re: (Score:3)
    
    by greenwow ( 3635575 ) writes:
    
    I disagree that SMART data helps with diagnosing failures. I save the output of "smartctl -a /dev/?" every night for every drive on every server. I haven't seen anything that predicted the huge number of SSD failures that you have with heavy use. We started using them three years ago when we started buying servers with 2.5" drive bays. I think we've replaced the ~75 drives about 120 times. Yes, more than once. If someone could come up with a predicting failures then they will become rich.
    - Re: (Score:2)
      
      by AmiMoJo ( 196126 ) writes:
      
      With hard drives a sure sign of imminent failure is the sector retry or reallocation count increasing.
    - Re: (Score:2)
      
      by Junta ( 36770 ) writes:
      
      This is why I shake my head when I see someone going to a lot of trouble to track SMART data to 'know' when a disk is going to fail. It just makes it all the more disappointing when a drive fails and all the early warning effort did nothing.
      It is a much more robust approach to be able to not *care* if you don't see the failure coming or not than to try to be able to plan for an outage. SMART has no idea that a component on the controller board is going to burn out suddenly. Yes it can track things with k
    - Re: (Score:3)
      
      by account_deleted ( 4530225 ) writes:
      
      Comment removed based on user account deletion
- Re: (Score:3)
  
  by jellomizer ( 103300 ) writes:
  
  I find a lot of fear around new technology to be the same as the fear of flying.
  Where numbers all point to a better more robust product, there is just more anxiety for when something goes wrong, mostly because when it does, there is little to do to fix it.
  The old spinning drive if failed, you can sometimes put it in the freezer power it up and get the data off, or if you are more technical you can open it up, and move the data disks to an other drive.
  But for the most part, Standard best practices of keeping
- Re: (Score:3, Interesting)
  
  by I-am-a-Banana ( 940550 ) writes:
  
  Seriously, you do not. You may know the end-result sometimes (head-crash), but the root-cause is usually not clear.
  So get over it. It is a new black-box replacing an older black-box.
  Well I need to partially disagree with you there. With a traditional drive when it fails and you take it apart carefully you can try and determine what happened. If it was a head crash you may be able to see what caused the head crash. In my case a Quantum or Maxtor drive that had 3 extra screws shipped in it loose where the inside control circuitry was. You could tell if it was a frozen motor, or if you are lucky find that the external board had a fried electrical component on it. For friends I desoldered
- Re: (Score:2, Interesting)
  
  by Luckyo ( 1726890 ) writes:
  
  You do actually. Many if not most disk failures have clearly predictable markers. This has been true for quite a long time at this point, to the level where my last two HDD failures in home machine were diagnosable with no tools beyond SMART reader. Better yet, they weren't "instant" failures, but signs of impeding failure of the drive started appearing months in advance with clear cut warnings on SMART readout. This resulted in sufficient time to buy a new drive and migrate all the data with no problems.
  Wi
- Re: (Score:2)
  
  by Chewbacon ( 797801 ) writes:
  
  Yep. I've had a number of spinning drives just drop dead on me. Some advice: Western Digital makes returns pretty easy for their drives and, when it comes to all drives, backup regularly/often!
- - Re: (Score:2)
    
    by 110010001000 ( 697113 ) writes:
    
    What "specialist area"? I am a specialist in all things. Any moron knows any manufactured thing isn't theoretically immune to manufacturing failures. Not sure why are you talking about "root cause". I never mentioned that.
- - Re: (Score:3)
    
    by Junta ( 36770 ) writes:
    
    Old stereos from
    the 1970's are still in service
    Well, old stereos from the 1970s that are still working are still in service. No one talks about the old stereos that died in the 70s because that's boring.
    SSDs are going to be in the same boat. Like all other electronics, some have a ticking time bomb and will probably fail within the first 5 years or so. Those that have the perfect voltage regulation and capacitors and such will last until their NAND wears out and they could also seem long lived (except the capacity is going to be so pathetic that no o
Department of Computer Science (Score:2)

by 110010001000 ( 697113 ) writes:

Hey Chris from Department of Computer Science has a problem. Let's hear about it, Chris.
- Re: (Score:2)
  
  by 93 Escort Wagon ( 326346 ) writes:
  
  Since they’ve now edited the summary (hooray for editing), I’ll note for the edification of future readers: The original quotes in the summary were attributed to “Chris from Department of Computer Science”.
This is why you have RAID and backups (Score:4, Informative)

by froggyjojodaddy ( 5025059 ) writes: on Tuesday December 11, 2018 @12:29PM (#57786324)

*shrug* ?

I mean, manufacturing defects, environment, and just old plain bad luck? SSDs have come a long way, but if I have anything of importance, I'm RAID'ing it and backing up. I feel anyone with an understanding of technology knows the importance of this.

Share
twitter facebook
Controller failure (Score:5, Insightful)

by macraig ( 621737 ) writes: <mark.a.craigNO@SPAMgmail.com> on Tuesday December 11, 2018 @12:30PM (#57786334)

I've had two SSDs die utterly. It wasn't because there was a failure of any part of the actual storage pathways: it was irreparable failure of the embedded controller circuits. The Flash itself was still fine and safely storing all my data, but there was no means to access it. At least with a platter drive if the PCB fails, you can unscrew and detach it and replace it with a matching PCB from another drive; no way to do that with an SSD. Early on when manufacturers were spending all their time hyping the comparative robustness of the Flash medium, they conveniently forgot to mention how fragile and not-so-robust the embedded third-party controller circuits could be.

Share
twitter facebook
- Re:Controller failure (Score:5, Informative)
  
  by bobbied ( 2522392 ) writes: on Tuesday December 11, 2018 @01:02PM (#57786602)
  
  Wow, that PCB substation trick became very hit/miss a long time ago.
  Now days, there is a whole bunch of operational parameters which need to be set properly to get data on/off a drive. I understand that Some of these "configuration" items are now stored in non-volatile memory on that PCB and set during the manufacturing process. Similar serial numbers may help, but it's still very hit or miss.
  
  Parent Share
  twitter facebook
It's not that scary... (Score:5, Informative)

by FrankSchwab ( 675585 ) writes: on Tuesday December 11, 2018 @12:31PM (#57786340) Journal

Infant failures are common in electronics ( https://www.weibull.com/hotwir... [weibull.com] ) From a simple standpoint, imagine a poorly soldered junction on the PCB - soldered well enough to pass QC and work initially, but after a couple of heating cycles the solder joint fractures. The same kinds of problems occur inside chips - wire bonds between the package and die may be defective but initially conductive, and fracture due to thermal cycling.
Similar problems can occur on the die. The gate oxide for a particular transistor might be too thin due to process issues. If it's way too thin, it'll fail immediately and the die will get sorted out at test. If it's just a bit thicker, it might pass all production tests but fail after an hour or two of operation, or 100 power cycles. If it's just a bit thicker (where it should be), it might last for 20 years and a million power cycles.
Everyone in the semiconductor industry would love to figure out how to eliminate these early failures. No one has found a way to do it.

Share
twitter facebook
- Re: (Score:2)
  
  by bobbied ( 2522392 ) writes:
  
  Which is why "burn in" operation, where you run the item though some thermal cycles is often done. We are trying to find the stuff that's going to initially fail.
  I usually do 24 hour burn in of all hardware I build, 12 hours on, then 2 hour cycles on off. Or, (sarc on) just load windows and run all the updates. (sarc off) It's almost the same thing anyway.. :)
Sudden stop vs small warnings (Score:2)

by atrex ( 4811433 ) writes:

In my experience with HDDs you'll usually get some warning that your drive has issues before it completely calls it quits. Whether it's bad sectors turning up or noises from the drive itself. If you pay attention to that (and you're a little lucky), you can manage to salvage most of the drive's contents before it dies completely.

With an SSD one minute it's working completely fine and the next it's completely gone. While most of the data itself is probably still perfectly intact on the flash memory, get
- Re: (Score:2)
  
  by MightyYar ( 622222 ) writes:
  
  I agree, but this has no practical benefit to me. When the HDD starts to throw errors, I pull it out of the RAID and stick in a new one. If the SSD completely up and dies, I pull it out of the RAID and stick in a new one. If more drives die or start to throw errors than there is redundancy, I restore from backup. If I can't restore from backup, well, then maybe then I'd appreciate the slowly-dying hard drive :)
  - Good luck putting RAID in a laptop (Score:2)
    
    by tepples ( 727027 ) writes:
    
    I doubt that most home PC users have both the case space and the cash for a RAID. A user of a mainstream laptop sure doesn't.
    - Re: (Score:2)
      
      by MightyYar ( 622222 ) writes:
      
      Yes, I'm in that position with my small notebook. In my case, I imaged the drive when I first got it. I have Windows Backup set to backup to an NAS and I have iDrive installed for offsite backup. Most people don't need to go so crazy - they can get away with running Dropbox, OneDrive, Google Drive, etc. as their primary "Documents" folder and then letting Geek Squad put in a new drive and reinstall Windows. But even "most people" need to have backups of some kind. If they can't image a disk, they certainly
    - Re: (Score:2)
      
      by omnichad ( 1198475 ) writes:
      
      A lot of mainstream consumer laptops come with an M.2 slot for configurations with SSD but still have the SATA port for models with an HDD. You can fill both slots and make a RAID - the disks will just be different shapes. Software RAID, sure, but it can definitely be done affordably.
    - - Re: (Score:2)
        
        by tepples ( 727027 ) writes:
        
        If the SSD is replaceable then you should simply just use your backup
        How many days old is your backup?
        restore to a new drive
        How many days of shipping away is the new drive?
- Re: (Score:2)
  
  by fahrbot-bot ( 874524 ) writes:
  
  In my experience with HDDs you'll usually get some warning that your drive has issues before it completely calls it quits. Whether it's bad sectors turning up or noises from the drive itself. If you pay attention to that (and you're a little lucky), you can manage to salvage most of the drive's contents before it dies completely.
  In 2009, I had a 10 year-old 5 GB (yes, 5) enterprise SCSI disk (at home, not work) that failed to spin up after being off for over a year. (before that it had been running almost continuously) I tapped it (pretty hard) on the side with a screwdriver handle while it was "clicking" when I powered it up after removing the PC case. I slooowly spun up and worked fine. It had some bearing noise, but that went away after the drive warmed up. I pulled the data off and ran the drive for a couple of days w/o in
- Re: (Score:2)
  
  by Gilgaron ( 575091 ) writes:
  
  With a HDD I can envision how they can pull the platter and do forensics on it, do you know how they take a peak in an SSD's memory at a professional service? It didn't occur to me until just now that I had no idea how they'd do it.
  - Re: (Score:2)
    
    by saider ( 177166 ) writes:
    
    Connect to the controller board on the address and data lines for the flash chips, and manipulate them to access the chips. Then you would need to have a program that understands how this controller manages things and can reconstruct the sectors that it presents to the outside world.
It's the binary nature of it.. literally (Score:2)

by Mysticalfruit ( 533341 ) writes:

With a spinning disk, you'll usually get an indication of a problem with a plethora of S.M.A.R.T errors.

It's been my experience that when an SSD dies... you just suddenly appear to have an empty drive cage. It's a really ugly binary failure.

I've taken to building my boxes with mirrored SSD's combined with taking and validating my backups.
- Re: (Score:3)
  
  by azcoyote ( 1101073 ) writes:
  
  I can see what you mean, but I think I won't really understand it until it happens to me (and I hope it never happens to me). I'm on my third SSD and none has ever failed; my previous one was showing some age and was SATA so I upgraded to M.2 NVMe on Cyber Monday. Perhaps they haven't failed on me because I keep most of my data on a HDD RAID array and use the SSDs only for OS, program files, and very limited caching.
Low Bidders (Score:3)

by bill_mcgonigle ( 4333 ) * writes: on Tuesday December 11, 2018 @12:36PM (#57786380) Homepage Journal

It's bad firmware. Some of the drives can supposedly be resuscitated by the factory or people who have reversed the private ATA commands.
I mean, at a minimum unless it's a PHY failure (and there's no reason to suspect those) the firmware could at least report missing storage (I've actually seen a 0MB drive failure once or twice) but their usual failure mode is to halt and catch fire, as the author notes as their usual behavior.
With the recent reports about the inexcusable security problems on Samsung and Crucial drives this is starting to feel like the old BIOS problems with Taiwanese mobo companies outsourcing to the lowest bidder and shipping bug-laden BIOS with reckless abandon. It's OK, all the world's servers only depend on this technology.
To be fair, I have batch of 20GB Intel SLC SSD's that have never done this, but those are notable exceptions. At this point only low-end laptops like Chromebooks don't get at least a mirror drive here.

Share
twitter facebook
Why does it matter? (Score:5, Informative)

by CaptainDork ( 3678879 ) writes: on Tuesday December 11, 2018 @12:46PM (#57786464)

I'm a retired IT guy and there's no kind of something that didn't fucking break. I'm not a goddam engineer. My job was to locate the problem at a black-box level and get the shit running again. Contemplating the "why" of a hardware failure is wheel-spinning instead of pulling the stuff out of the ditch.
For new purchases under warranty, I exchanged them and sent the dead one back to the vendor. Let them hook it up and do diagnostics over a cup of coffee.
I had work to do.

Share
twitter facebook
- Mod Parent Up (Score:2)
  
  by mykepredko ( 40154 ) writes:
  
  Maybe it helps the author to develop a narrative, but the long and short of it is, the author's non-volatile storage unit died, he needs to replace it to get the system back and he can send it back to where he bought it from because it died under warranty. Or, he might want to have it destroyed locally if it contains proprietary information.
  If you're in IT, I'm sure you'll see everything eventually break (including things like cases which don't make any sense at all) so why sweat it?
  - - Re: (Score:2)
      
      by CaptainDork ( 3678879 ) writes:
      
      Victim blame much?
- Re: (Score:2)
  
  by dcw3 ( 649211 ) writes:
  
  Then you also know that if you've been seeing an unusual trend in some items breaking, it's probably cost effective for you to look for a root cause, and fix the problem, or find a suitable substitute to break the cycle. This is why we keep metrics on outages. It's not so much your job as the "IT guy", but whoever is managing the program/IT should be interested because it's costing them money.
Both are black-ish boxes (Score:2)

by wbr1 ( 2538558 ) writes:

Yes, you can listen for mechanical issues, yes you can (sometimes) read bad block and other SMART data. But, ultimately, without millions in equipment and skills, you just do not know. It is a cheap data storage brick. Choose one appropriate for your capacity and I/O needs, have a good backup plan in place, and quit whining.
Shit happens.. (Score:2)

by Rick Schumann ( 4662797 ) writes:

..and the more complex a machine is, the more that can go wrong with it.
The controller PCB on a brand-new modern HDD can fail, rendering the entire device useless; any piece of silicon on a modern SSD can fail also, rendering the entire device useless. The only difference here is that with a HDD, if you happen to have another working drive of the exact same model and revision level, you could theoretically swap the controller PCB and be able to access the data on the platters again (I've done this). With a
Forward error correction (Score:2)

by Strider- ( 39683 ) writes:

Despite what others have said, this comes down to the brick wall nature of error correction codes. Every time you erase and rewrite a flash cell, you as wear to the transistors that make up the memory cell. Eventually (and probably immediately too) some of the bits won't read correctly. To compensate for this, the controller runs a mathematical function on your data, allowing it to recover from a certain percentage of bar bits. This is good, as that combined with wear leveling allows it to run a long time.
it's worse in space.. (Score:2)

by unfortunateson ( 527551 ) writes:

Reports from the ISS are that 9 out of 24 SSD drives failed in an HP supercomputer they'd brought up there. Quite scary how fragile those things are from radiation.
Also here (Score:2)

by jf_moreira ( 923817 ) writes:

That happened to me three or four times already. They die without warning. No SMART indication, nothing. It really pisses us off. Someone needs to technically give us some kind of anticipation. Maybe SMART is not supposed to work well with SSD after all.
The spin is in! (Score:5, Insightful)

by theendlessnow ( 516149 ) * writes: on Tuesday December 11, 2018 @01:06PM (#57786644)

One thing I like about spinning disks is that a lot of times the failure is gradual. Bad sectors and such and you have the opportunity to grab data off the drive (noting, you really should have backups).

With SSD, whatever the issue, it's more like losing a controller board on the drive, everything dies and ceases to operate.

So... I'll go along and say SSD is "better" and more "reliable", but when it dies, it dies hard. Just the way it is. (not talking about performance degradation... speaking about failure)

Share
twitter facebook
Damage from static electricity is a good bet (Score:2)

by bdwoolman ( 561635 ) writes:

Improper handling of ungrounded components really can mess them up. They work but are defective. Take a look at some micrographs of ESD damage sometime.. ESD does not always kill a part it maims -- sometimes only slightly. Anti-static mats and wrist straps are no laughing matter, Okay. They are. But use them anyway.
- Re: (Score:2)
  
  by dcw3 ( 649211 ) writes:
  
  "Static Zap makes Crap" - One of my favorite sayings from Computer Tech training in the USAF back in the 70s.
Heat (Score:2)

by Thelasko ( 1196535 ) writes:

Most of the time heat kills electronics. Either they get too hot and something fries, or they suffer thermal fatigue. [wikipedia.org]
- Re: (Score:2)
  
  by dcw3 ( 649211 ) writes:
  
  Heat, static, condensation, unstable power, radiation, magnetic fields, vibration...pick your poison. It all depends on the environment you're working in and how well the equipment was designed.
Failure done right - Sandisk USB (Score:3)

by Stonent1 ( 594886 ) writes: <stonent&stonent,pointclark,net> on Tuesday December 11, 2018 @01:14PM (#57786720) Journal

I had a Sandisk USB stick recently go read only. I had been using it as a hypervisor boot drive and the boot was crashing. When I inspected it, it was read only and any attempts to format it, diskpart it, fdisk it failed with some kind of error. I looked it up and apparently this is the designed failure route for these USB drives. When the controller detects an inconsistency or uncorrectable error, the drive is locked from writing so you can get data off of it.

Share
twitter facebook
He's right. (Score:3)

by GameboyRMH ( 1153867 ) writes: <gameboyrmh&gmail,com> on Tuesday December 11, 2018 @01:16PM (#57786730) Journal

SSDs really are unpredictable timebombs, so act appropriately - take frequent backups and use RAID if the downtime from a sudden SSD failure with zero warning is unacceptable. Any IT department that hasn't been prepared for the nature of SSD failures since long before they were available off the shelf was doing it wrong anyway.
I'm most worried about what SSDs mean for the Average Joe, whose data is largely protected by the predictability and recoverability of most hard drive failures. SSDs throw all of that out the window and lure them in with the warm glow of performance like moths to a flame. Average Joes need a real wake-up call on the importance of backups with the switch to SSDs.

Share
twitter facebook
The Failure Modes (Score:2)

by Sarusa ( 104047 ) writes:

So you can have peace of mind:
If it dies suddenly, without warning, it's 1) buggy firmware (I think this is by far the biggest culprit), or 2) bad components/soldering/cleaning on the PCB board, or 3) a really dumb controller that isn't doing wear leveling on every single thing (think the master index), so when a critical flash cell dies the entire thing is dead even though there's plenty of good flash left (this was common with crappy little 'SSDs' that were just Compact Flash), or 4) a badly designed cont
- Re: (Score:2)
  
  by krray ( 605395 ) writes:
  
  I've had multiple OWC branded SSD's die on me. I usually like OWC branded items, but the SSD failure has me pulling any / all such branded ones out of service.
  It was my understanding that a failing SSD (can't write anymore properly) should flip itself over to READ ONLY mode. At least this would give you a chance to pull the existing data off the drive.
  The OWC failures were catastrophic (sans I had working backups :). When these SSD's failed they were just GONE. Nothing. The system wouldn't see them even con
Tiny wires, heat bad (Score:2)

by HeckRuler ( 1369601 ) writes:

SSDs have a bunch of tiny wires. When you push electricity through wires they heat up, they're not perfect super-conductors. If you heat it up too much, it will of course burn, but they avoid that. Still, heating up a wire over and over will have some wear and tear. For big thick power-lines in houses, this doesn't have too much effect, but for tiny precision electronics, it builds up. And SSD's have a LOT of those wires with a little bit of manufacturing variance which makes some parts fail sooner.
They bur
- Re: (Score:2)
  
  by FrankSchwab ( 675585 ) writes:
  
  Wires? Burn out the same way lightbulbs burn out?
  Your understanding of electronics is remarkably wrong.
Mechanical vs Electronics (Score:2)

by Shotgun ( 30919 ) writes:

I'm going to disagree with the people saying that spinning disks don't give you a warning of imminent death. A bad spindle will start whirring, and steadily get louder, and my experience has been that most drives go that way. Hence, the old trick of sticking the drive in a freezer to get a few minutes more life out of it (because, you didn't keep your backups updated....again. :-(
This is a phenomena that should always be kept in mind when switching from mechanical to electronic systems. The electronic ar
For Chris's peace of mind. (Score:2)

by Tjp($)pjT ( 266360 ) writes:

New SSDs, failure could be a die bond failure, a sometimes defect that allows it to pass inspection then fail. Or a ball bond to PC failure that can be intermittent as the package, solder ball, and PC change dimensions due to different thermal expansion coefficients. The tiny contacts on the PC versus relatively huge contacts on the mechanical hard drive make these happen more often on SSDs.

On older SSDs there could be degradation of the ability to hold or modify the stored charge that represents bit. Not
Spinning disks used to more unnerving... (Score:2)

by gosand ( 234100 ) writes:

I had a 4 tb spinning drive fail, after only 2 years. It was 75% full. That is what is scary to me. The only narrative I came up with to explain it was that it was in my system, but powered on, 24x7. Now my backup drives are external and I power them on when I need them.
As drives get bigger, that is when I get nervous. I know, there's options to mitigate that, but I'm on a budget. I just migrated my OS to an SSD a couple of months ago, and still have spinning drives holding everything else.
HDs were scary too at some point (Score:3)

by foxalopex ( 522681 ) writes: on Tuesday December 11, 2018 @02:42PM (#57787366)

I'm guessing the author never lived through the era when there were a lot more companies in existence for mechanical HDs than there are now. HD's can spontaneously die from a failed motor, electronics failure or catastrophic crash. Some small companies went completely under and were swallowed up by larger manufacturers due to massive defects. SSDs have gone through the same era as well with buggy firmware. Generally speaking thou if you stick to the big manufacturers like Samsung and Intel the chances of fatal issues goes down a lot. That said an SSD is not a guarantee of safe data. They're far more reliable but circuit failure or static electricity can kill SSDs. Besides, SSDs won't save you from an accidental erase all.

Share
twitter facebook
- Re: (Score:2)
  
  by 110010001000 ( 697113 ) writes:
  
  Uh, if a disk dies in 2 months you need to get a replacement, not a repair.
  - Re: (Score:2)
    
    by tepples ( 727027 ) writes:
    
    Then read it as "Samsung would not ship the replacement until it received the returned unit." This still implies a week's downtime.
    - Re: (Score:2)
      
      by 110010001000 ( 697113 ) writes:
      
      You should demand cross shipping for that. Any professional would.
- Re:Heading should be (Score:4, Funny)
  
  by 110010001000 ( 697113 ) writes: on Tuesday December 11, 2018 @12:29PM (#57786326) Homepage Journal
  
  Waterboarding?
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by bobbied ( 2522392 ) writes:
    
    Waterboarding?
    Well... Funny, but water mixed with electronics tends to produce situations where little communication takes place....
- Re: (Score:2, Informative)
  
  by Anonymous Coward writes:
  
  Electronics wear out slowly. In fact most will long exceed their usefulness before they die.
  Mor often electronics will die early due to manufacturing defects. It's why if your device lasts the first month it will probably keep working until you upgrade it. SSD's are a different beast though. thus they have excess capacity to handle wear leveling. Still a young drive that dies is usually, again, a sign of a manufacturing defect.
  - Re: (Score:2)
    
    by freeze128 ( 544774 ) writes:
    
    Correction: PROPERLY DESIGNED electronics wear out slowly. Improperly designed electronics may not even last past the warranty period. Since there is a huge demand for SSDs in increasing capacity, I can't help but think that manufacturers are pushing the bounds of reliability in favor of capacity. The manufacturers may just be relying on the SSD's built-in correction capability to correct for the decrease in reliability, but that will only get you so far.
  - - Re: (Score:3)
      
      by omnichad ( 1198475 ) writes:
      
      Not using TRIM doesn't have a huge effect on SSD life. Just performance. Write amplification adds some wear, but not enough to be drastic. And it won't cause sudden failure either - just normal wear on the wear-levelling curve. Sudden failure is by definition going to be something that's not related to routine depletion of a fixed lifespan.
- Re: (Score:3)
  
  by bobbied ( 2522392 ) writes:
  
  Doesn't know how SSD's work.
  No offense to CS majors, but this EE major tends to understand "How a computer works" at a lower level than most of you programmer types. While not universally true, in my experience a Computer Science major generally get's outside their comfort zone with hardware once you get past "Plug it in and turn it on." I don't blame them, there is a lot of stuff happening at lower levels than a CS major needs to know to do their job.
  That some CS major is concerned about how SSD's fail because he doesn't understand
  - Re: (Score:2)
    
    by BLToday ( 1777712 ) writes:
    
    Doesn't know how SSD's work.
    No offense to CS majors, but this EE major tends to understand "How a computer works" at a lower level than most of you programmer types. While not universally true, in my experience a Computer Science major generally get's outside their comfort zone with hardware once you get past "Plug it in and turn it on." I don't blame them, there is a lot of stuff happening at lower levels than a CS major needs to know to do their job.
    That some CS major is concerned about how SSD's fail because he doesn't understand their failure modes is fine. We tend to fear what we don't understand and let's face it, there is a LOT of stuff going on inside a computer that high level users simply don't need to know. Heck, even I don't need to know some of that stuff and I've designed computing systems in the past. Fear not, if it works, it works, if it doesn't you just replace it anyway.
    This ^^^. I had a brilliant CS college roommate. But when he built his first computer himself, the motherboard was held to the case with one screw. He couldn’t figure out why it was crashing all the time. Everything in the machine was barely in their slots/socket. This is back in the Pentium days. Days of VLB and very early AGP. And sometimes IRQ switches.
- Re: (Score:2)
  
  by MightyYar ( 622222 ) writes:
  
  Uh, for the massive performance boost you get from an SSD, they are totally worth setting up a backup job. Image the disk, set periodic backups to a server or even iDrive/Crashplan/Dropbox/etc and carry on with life. Hell, even leave the spinning disk in place and backup to that. For $60 you can extend the life of an old PC by several years simply by swapping in an SSD.
  You should have backups anyway.
- Re: (Score:2)
  
  by tepples ( 727027 ) writes:
  
  What, you don't have at least 32GB of RAM?
  I see your point about prefetching most of your environment to disk cache. That's why Microsoft added the "SuperFetch" feature to Windows over a decade ago and Canonical added "ureadahead" to Ubuntu. But there are three problems:
  First, many tablet computers and compact laptops lack slots for 32 GB of RAM.
  Second, even on those machines that can take 32 GB, loading 32 GB when booting or when waking from hibernation takes a while before the prefetch stops being a source of read latency.
  Third, when a file is wr
- Re: (Score:2)
  
  by tepples ( 727027 ) writes:
  
  had a 2gb memory card once. A Day One fault of one 512 mb block dead. Windows could not recognise this fault nor fix it. Instead writing to the card had corruption (obviously) when the faulty block was engaged.
  Then Microsoft messed up by not offering a "try writing to all unallocated clusters" mode in the surface scan in chkdsk.
- Re: (Score:2)
  
  by omnichad ( 1198475 ) writes:
  
  Don't blame the OS. Blame "no backups." Failure should be expected and accounted for with a backup plan.
- Restore which version? (Score:3)
  
  by tepples ( 727027 ) writes:
  
  Who the hell cares? Replace it and restore your data.
  The data on a failing drive might be a newer version than the most recent weekly backup. I see value in backing up the newer version elsewhere as the first part of replacing the drive. But SSD failure modes allegedly make this newer version inaccessible sooner than HDD failure modes.
- Re: (Score:2)
  
  by omnichad ( 1198475 ) writes:
  
  Should have skipped Intel and OCZ and just waited for the Samsung EVO line. I've installed dozens over the last few years and not a single failure yet.
- - Not so (Score:2)
    
    by bagofbeans ( 567926 ) writes:
    
    Metal migration limits the lifetime of the interconnect in ICs. Absolutely a wear mechanism.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

With spinning disks, you do not know either (Score:5, Insightful)

Re:With spinning disks, you do not know either (Score:5, Insightful)

Re:With spinning disks, you do not know either (Score:5, Insightful)

Re: (Score:3)

Re:With spinning disks, you do not know either (Score:5, Interesting)

Re:With spinning disks, you do not know either (Score:5, Insightful)

Re: (Score:2)

Re: (Score:3)

Re:With spinning disks, you do not know either (Score:5, Interesting)

Re: (Score:3)

Re: (Score:3)

Re:With spinning disks, you do not know either (Score:5, Insightful)

Re: (Score:3)

Re: (Score:2)

Re: (Score:3)

Re: (Score:3)

Backup your data frequently (Score:3, Insightful)

Re:With spinning disks, you do not know either (Score:5, Insightful)

Re:With spinning disks, you do not know either (Score:5, Funny)

Re:With spinning disks, you do not know either (Score:5, Funny)

Re:With spinning disks, you do not know either (Score:5, Funny)

Re: (Score:3)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: With spinning disks, you do not know either (Score:4, Interesting)

Re: With spinning disks, you do not know either (Score:2)

Re: (Score:2)

Re:With spinning disks, you do not know either (Score:5, Informative)

Re:With spinning disks, you do not know either (Score:4, Informative)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:3)

Re: (Score:3, Interesting)

Re: (Score:2, Interesting)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Department of Computer Science (Score:2)

Re: (Score:2)

This is why you have RAID and backups (Score:4, Informative)

Controller failure (Score:5, Insightful)

Re:Controller failure (Score:5, Informative)

It's not that scary... (Score:5, Informative)

Re: (Score:2)

Sudden stop vs small warnings (Score:2)

Re: (Score:2)

Good luck putting RAID in a laptop (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

It's the binary nature of it.. literally (Score:2)

Re: (Score:3)

Low Bidders (Score:3)

Why does it matter? (Score:5, Informative)

Mod Parent Up (Score:2)

Re: (Score:2)

Re: (Score:2)

Both are black-ish boxes (Score:2)

Shit happens.. (Score:2)

Forward error correction (Score:2)

it's worse in space.. (Score:2)

Also here (Score:2)

The spin is in! (Score:5, Insightful)

Damage from static electricity is a good bet (Score:2)

Re: (Score:2)

Heat (Score:2)

Re: (Score:2)

Failure done right - Sandisk USB (Score:3)

He's right. (Score:3)

The Failure Modes (Score:2)

Re: (Score:2)

Tiny wires, heat bad (Score:2)

Re: (Score:2)

Mechanical vs Electronics (Score:2)