Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
Data Storage IT

Backblaze Finds SSDs Are More Reliable Than HDDs 51

williamyf writes: The fine folks at Backblaze have published their first ever report that includes their SSD fleet. To the surprise of no one, SSDs are more more reliable (0.98% AFR) than HDDs (1.64% AFR). The surprising thing thing was how small the difference is (0.66% AFR).

A TL;DR article by well regarded storage reporter Chris Mellor is here. Also worthy of note: S.M.A.R.T. attribute usage among SSD makers is neither standardized, nor very smart:

"Klein notes that the SMART (Self-Monitoring, Analysis, and Reporting Technology) used for drive state reporting is applied inconsistently by manufacturers. "Terms like wear leveling, endurance, lifetime used, life used, LBAs [Logical Block Address] written, LBAs read, and so on are used inconsistently between manufacturers, often using different SMART attributes, and sometimes they are not recorded at all."

That means you can't use such SMART statistics to make valid comparisons between the drives. "Come on, manufacturers. Standardize your SMART numbers."
This discussion has been archived. No new comments can be posted.

Backblaze Finds SSDs Are More Reliable Than HDDs

Comments Filter:
  • No Samsung? (Score:3, Informative)

    by Joce640k ( 829181 ) on Friday March 10, 2023 @01:20PM (#63359189) Homepage

    We're missing a major manufacturer... Samsung.

    • It would be interesting to know why. Did Samsung just not have a large amount for Backblaze to buy? Or they were retail channel only?
      • Re:No Samsung? (Score:4, Interesting)

        by tlhIngan ( 30335 ) <slashdot.worf@net> on Friday March 10, 2023 @02:08PM (#63359343)

        Samsungs aren't cheap. From what I can tell about BackBlaze from these reports, they pretty much are buying the cheapest drives on the market in as much bulk as they can to get as much storage for the dollar as possible.

        This is fine because all their data is protected against failure - if your data is stored in a sufficiently redundant RAID array, failure of drives isn't as big a deal and it's simply just getting as much storage as cheap as possible.

        And I would quantify failure of an SSD where errors crop up through the interface - even if SMART is completely fake, once you start getting read sector errors or write errors, those are real tangible failures to which you can decommission the drive and report it as failed.

        So a non-failed SSD could be working because it uses really good flash chips, or has a kickass error correction algorithm that corrects minor errors (they all employ ECC algorithms because errors are common, they're normal, and they're expected in operation). If an error bubbles up through the NVMe or SATA interface then it's critical and the SSD has effectively failed.

    • I imagine Samsung have priced themselves out of this market. If drives costing 2/3 of a Samsung do the job perfectly well, then at scale there are massive savings using e.g. Crucials and Seagates.

    • I don't know about other Samsung drives, but I can tell you for a fact about the EVO 870s.

      Namely that I found myself racing to replace all the ones I had installed after multiple spectacularly premature failures in a set of only eight. We're talking about 1TB drives failing with unrecoverable media errors after total write of 10s of TB.
      • by ksw_92 ( 5249207 )

        Same, with 1TB EVO 860s. They did not last 6 months in a simple NVR environment. The Kingstons that replaced them are still going after 2 years.

        • I checking my VM box yesterday, saw that my EVO 860 RAID was degraded, so I can report the same.

          And I have most assuredly not hammered the drives, I use that machine to test things maybe once a month or so.

        • I've found that lower-cost SSDs are often just as reliable as much pricier SSDs. That's just my anecdotal claim (whatever that's worth).

          The performance may be a bit slower by specifications, but often a fast drive is bottle-necked by some other component in the system and those higher speeds never get to make a difference.

          As for brands, Crucial, Kingston, Silicon Power, and SK Hynix have all worked very well for me for the last 6 ~ 8 years. No failures so far.

          I don't use Samsung much because they too expens

  • by whoever57 ( 658626 ) on Friday March 10, 2023 @01:23PM (#63359199) Journal

    "Terms like wear leveling, endurance, lifetime used, life used, LBAs [Logical Block Address] written, LBAs read, and so on are used inconsistently between manufacturers, often using different SMART attributes, and sometimes they are not recorded at all."

    The situation appears to be even worse with NVME drives.

  • I don't see how these statistics mean anything at all, without reference to some kind of duty cycle or bytes written/bytes read. When comparing HDDs to each other, one can assume the same usage cases so just time between failures makes sense. But SSDs are generally not used the same way as HDDs.

    • by Zak3056 ( 69287 )

      After reading the fine article, I can't help but agree with the above, and the duty cycle question is answered: these are BOOT drives as opposed to drives being used to store customer data. If you compare to the spinning rust statistics, you'll see that the sample population consisted of 4,299 boot drives and 231,309 data drives which are not really broken out (unless you try to deduce which, exact, drives are being used as boot media).

      The submitter should be ashamed of themselves if they wrote either the

    • I don't see how these statistics mean anything at all, without reference to some kind of duty cycle or bytes written/bytes read. When comparing HDDs to each other, one can assume the same usage cases so just time between failures makes sense. But SSDs are generally not used the same way as HDDs.

      There's another huge consideration, how failure rates vary over age. Even a decade ago, SSDs were already more reliable than HDDs, at least in the first year after discounting for early life failures. The problem with SSDs is that there was a cliff after so many years, where the cliff was more abrupt than for HDDs.

      The other caveat to the Backblaze numbers is that even though they buy a lot of drives, their numbers are still vulnerable to small sample size effects. Many failures, including early life but

  • You know how in HHD when you needed a secure data erase you would override the sectors with random bits 3 times? My one major problem with SSDs -- there is no sure reliable way to securely erase the data.

    • My one major problem with SSDs -- there is no sure reliable way to securely erase the data.

      It's called a hammer. Anyone that paranoid about someone recovering their files would be quite happy to eat the cost.

      • by Anonymous Coward
        A hammer is not enough. The truly paranoid, like the NSA, first put their drives through a degauss machine so that - working or not - the media get erased and then they put them through a shredder that produces particles no larger than 2mm^2. I would argue that if they're that paranoid they might as well have a single step process and put them through a blast furnace to melt them down.
    • by ctilsie242 ( 4841247 ) on Friday March 10, 2023 @02:20PM (#63359369)

      In theory, OPAL compliant SSDs (which AFAIK, all SSDs have this in place) can be "erased" by a secure erase command, which marks all blocks as free, and generates a new master key. A device-wide TRIM also, in theory, do it as well.

      Realistically, who knows. There have been issues a while back of encryption that drives use not being implemented securely (using the same key no matter what), and a device-wide TRIM may not really be dealt with to erase written pages for a long while.

      The best mitigation is to always use FDE. This way, a simple "blkdiscard -v -f /dev/sdwhatever" or the "nvme format" command will be good enough to ensure the data is not going to be recoverable, barring a major break in AES.

    • Re: (Score:2, Interesting)

      by Anonymous Coward

      3 times? No one has ever demonstrated they can recover anything from even one overwrite.

      You can overwrite a whole SSD, none of the blocks will be marked as empty so it will definitely wipe them all. A full disk discard/trim will not necessarily erase all the blocks though.

      With that said, both HDD's and SSD's have spare sectors and if some private data gets reallocated to the "failed" space then there is no way to wipe that unless the drive supports such a feature. There are some drives that say they can wip

    • by kenh ( 9056 ) on Friday March 10, 2023 @02:41PM (#63359433) Homepage Journal

      Yes there is.

      The issue with spinning hard drives is the magnetic residue left after data I'd "erased".

      There is no such "residue" to read in an SSD. To "erase" an SSD, simply re-partition the drive, rewrite EVERY sector with static data (all ones?), then blow away the partition table.

      Greater security requires physical destruction - chipper or sledge hammer, for example. If you are worried about data protection, maybe don't sell your used drives?

      • The issue with spinning hard drives is the magnetic residue left after data I'd "erased".

        There's no such residue, at least not in a meaningful enough magnitude to recovery anything from even a single zeroing out of the drive. The idea of reading zeroed data is a remnant from the 80s where bits were almost big enough to see.

    • by jon3k ( 691256 )

      My one major problem with SSDs -- there is no sure reliable way to securely erase the data.

      Yes, there is [kernel.org].

      But really all you have to do is write the capacity of the drive once and you've "erased" everything. Pretty easy, eg for a 512GB drive:

      $ dd if=/dev/zero of=/dev/nvme0n1 bs=1M count=512000

      • But really all you have to do

        In the time it takes to perform a secure erase by manually writing the entire drive you could code you own utility from scratch to issue the ATA secure erase command, a command that takes 2 minutes to execute, ... compared to about 3-5 hours for zeroing.

      • I really like the succinct command line info, but really, why can't it include, instead of writing /dev/zero, some ImageMagick magic for different sizes of the goatse image and/or two girls, one cup, until the drive is full? First, the command line is the one right tool for dealing with such imagery, and second, should anyone actually recover anything, they will most certainly regret it. Alternatively, as a less cruel proposal, one could put Rick Astley's Never Gonna Give You Up all over the drive. Or, alt
    • To securely erase an SSD: Use full disk encryption with a master key on the SSD. Then have modified firmware that allows erasing, not just overwriting the master key.
  • I feel like this is a pretty useless analysis. If the SMART numbers are not standardized or even tracked, what counts as a failure? Especially since these are being used as a boot drive, are they really getting enough random write time to provide meaningful data? After all, power-on hours aren't as meaningful for an SSD since it's not sitting there spinning the whole time.

    The 31,000 Dell SSDs with a zero failure rate tells me they are just good at internal error correction. I wonder how much data corruption

    • Re:Look closer (Score:5, Interesting)

      by blahabl ( 7651114 ) on Friday March 10, 2023 @01:54PM (#63359309)

      The 31,000 Dell SSDs with a zero failure rate tells me they are just good at internal error correction. I wonder how much data corruption is just washed over by its internal wear leveling...

      Yeah, the thing is, noone cares. Either the data is safe or it isn't, and I don't care if it's because of super-resilient chips, or because of awesome wear-leveling algorithms, or magic pixie dust.

    • The 31,000 Dell SSDs with a zero failure rate tells me they are just good at internal error correction. I wonder how much data corruption is just washed over by its internal wear leveling...

      Is a corrected error really a failure?

      If zero incorrect/corrupted data is passed out of the drive, that sounds like "zero failures" to me.

  • Had a consistent problem with corruption on disk on my laptop (yes it was running linux).

    The stock HDD (not SSD) was a Seagate.

    SMART said everything was fine.

    Swapped out for a new WD HDD and haven't had a problem since.

    SMART helped me diagnose the problem not at all. unfortunately Linux did by telling me fsck had failed :-(

    • by korgitser ( 1809018 ) on Friday March 10, 2023 @02:39PM (#63359423)

      First of all, for a bit of sarcasm, Seagate has been known for like fifteen? years to be the lowest quality manufacturer. Is it really S.M.A.R.T. to buy a drive from them?

      But more importantly, Smart helps, but it's not perfect, and cannot be. Similar to the halting problem, there are always going to be problems it cannot catch. Not just in the sense that it is logically impossible to catch them all, but also in the sense that to even try so would be economically unwise. Smart has never been supposed to be more than just picking the low hanging fruit.

      Going more into the specifics, Smart has no idea about the health of the drive motor, the bearings thereof. Nor about the actual electronics on the board, other than the limited info we now have about the health on flash cells.

      I have not made statistics, but I would say that fr roughly every 9 drives that Smart has notified for me to go bad, there has been one drive that died without warning.

      Since everybody is obviously running raid and backups for every system where they care about uptime and data safety, a drive dying without warning is not a big problem. But it sure is nice to get to plan for and replace most of them on your own schedule instead of as a forced move.

  • At DC environment and less than 40C it really does not tell me anything without telling me the role and is something else truely caching the epic amount of i/o bullshit in the modern data center. And the role of the drive is important.... logging server, telemetry archive database server vs webserver/fileserver/email server. I have drives that have sat doing less than 10k writes per day for a decade with less wear than the packet broker witness disks have after a week. Role is very important, developin
  • All my SSD failures have involved the dreaded "Bad Context" problem where your drive is suddenly only 32MB, the disk drive becomes named "Bad Context" (instead of the model of the drive), and you can't read or write the drive.

    • by antdude ( 79039 )

      My very first SSD (Corsair Force Series F115 (115 GB; CSSD-F115GB2-BRKT-A)) suddently died last month without warnings. I tried powering off and on, resetting CMOS, and swapping SATA and power cables between a working Seagate 320 GB HDD and the SSD. The cables were fine. 14 yrs. old PC's BIOS just don't see old SSD was from 11/24/2011. I guess its controller or hardware part died. :(

  • You give me $1000, and I will give you back $500, okay?
    • You give me $1000, and I will give you back $500, okay?

      Amazing! I even gave you a TL;DR, but noooo commenting without reading:

      From TF-TL;DR :

      "Based on its SSD and HDD AFR percentages, the difference is 1.64 – 0.98 = 0.66, not even one in 100 drives. In a 1,000-HDD population, we would expect 16.4 to fail while with 1,000 SSDs we expect 9.8 to fail – a difference of 6.6 drives. The reliability difference is much less than we would have expected."

  • They talk about drive failures not normalized to capacity. While an interesting factoid, more interesting might
    be failure rates/TB and cost/TB.

    What is "error rate / TB" should be their question, not failures/drive. If hd's have a 1.66% failure /drive, is that for
    8/12/16TB?

    If failure for SSD's is .66%, how many SSD's would it take to make same size storage as one 8TB HD?
    Using 1TB SSD's, the error rate would go up as you approach same capacity HD. I.e. it's not 6.6% for 10 SSD's,
    but 1-.9934^#drives to get

  • HDDs are just as inconsistent. About the only reading that means anything on them is the reallocated sector count, and that's mostly an indicator that the failure already occurred and you better get moving on replacement. It's not that hard to figure out if you've made a habit of paying attention to those readings.
    • Yeah they are, MUCH better. The thing is HDDs are also reporting mostly physical attributes, and not performing a calculation of useful life. Quite a lot of HDD attributes are very standard across manufacturers and models.

      It also means that SMART data is very different in usefulness, largely few people give a shit about HDDs metrics other than what the clicking sound is already telling them.

  • What makes you think manufacturers want detailed comparison data between their products and other vendors?

  • For most any hard drive manufactured from 1995 you can pull it out of the computer, put it on the shelf, and the data is still readable 15 years later. SSDs in my experience scramble after 6 months without being powered up. That's a concern separate from reliability in service, but a consideration in some situations

    [yes, I know: tape is better for this purpose. But that doesn't help you when you are in the back room looking for the only PC on Earth with the software to reprogram the 1990 vintage machine to

  • Then afaik, it's less likely to fail than a new one, to a statistically significant degree.
    Buying high quality recertified ones can be a good deal, especially if you buy one with a warranty. Long format it, and deep scan it for errors, and it's off to the races, imo. The freeware DiskGenius can be useful for that, and it was a lifesaver the one time my hard drive dock made a drive unreadable to Windows. I recovered everything using DiskGenius, with no limit on how many files it allowed me to recover.

    But any

  • Your SSDs are only as good as the UPS and surge protector they're plugged into. Ask me how I know.

Pascal is not a high-level language. -- Steven Feiner

Working...