Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Data Storage Media IT

One Way To Save Digital Archives From File Corruption 257

storagedude points out this article about one of the perils of digital storage, the author of which "says massive digital archives are threatened by simple bit errors that can render whole files useless. The article notes that analog pictures and film can degrade and still be usable; why can't the same be true of digital files? The solution proposed by the author: two headers and error correction code (ECC) in every file."
This discussion has been archived. No new comments can be posted.

One Way To Save Digital Archives From File Corruption

Comments Filter:
  • To much reinvention (Score:5, Interesting)

    by DarkOx ( 621550 ) on Friday December 04, 2009 @08:56AM (#30322692) Journal

    If this type of thing is implemented at the file level every application is going to have to do its own thing. That means to many implementations most of which wont be very good or well tested. It also means applications developers will have to be busy slogging though error correction data in their files rather than the data they actually wanted to persist for their application. I think the article offers a number of good ideas but it would be better to do most of them at the filesystem and perhaps some at the storage layer.
        Also if we can present the same logical file when read to the application even if every 9th byte is parity on the disk that is a plus because it means legacy apps can get the enhanced protection as well.

  • by commodore64_love ( 1445365 ) on Friday December 04, 2009 @09:00AM (#30322724) Journal

    >>>"...analog pictures and film can degrade and still be usable; why can't the same be true of digital files?"

    The ear-eye-brain connection has ~500 million years of development, and has learned the ability to filter-out noise. If for example I'm listening to a radio, the hiss is mentally filtered-out, or if I'm watching a VHS tape that has wrinkles, my brain can focus on the undamaged areas. In contrast when a computer encounters noise or errors, it panics and says, "I give up," and the digital radio or digital television goes blank.

    What we need is a smarter computer that says, "I don't know what this is supposed to be, but here's my best guess," and displays noise. Let the brain then takeover and mentally remove the noise from the audio or image.

  • by Anonymous Coward on Friday December 04, 2009 @09:02AM (#30322734)

    The PNG image format divides the image data into "chunks", typically 8kbytes each, and each having a CRC checksum. You'd archive two copies of each image, presumably in two places and on different media. Years later you check both files for CRC errors. If there are just a few errors, probably they won't occur in the same chunk, so you can splice the good chunks from each stored file to create a new good file.

  • by commodore64_love ( 1445365 ) on Friday December 04, 2009 @09:09AM (#30322794) Journal

    P.S.

    When I was looking for a digital-to-analog converter for my TV, I returned all the ones that displayed blank screens when the signal became weak. The one I eventually chose (x5) was the Channel Master unit. When the signal is weak it continues displaying a noisy image, rather than go blank, or it reverts to "audio only" mode, rather than go silent. It lets me continue watching programs rather than be completely cutoff.

  • About time (Score:3, Interesting)

    by trydk ( 930014 ) on Friday December 04, 2009 @09:10AM (#30322802)
    It is about time that somebody (hopefully some of the commercial vendors AND the open source community too) get wise to the problems of digital storage.

    I always create files with unique headers and consistent version numbering to allow for minor as well as major file format changes. For storage/exchange purposes, I make the format expandable where each subfield/record has an individual header with a field type and a length indicator. Each field is terminated with a unique marker (two NULL bytes) to make the format resilient to errors in the headers with possible resynchronisationthrough the markers. The format is in most situations backward compatible to a certain extent as an old program can always ignore fields/subfields it does not understand in a newer format file. If that is not an option, the major version number is incremented. This means that a version 2.11 program can read a version 2.34 file with only minor problems. It will not be able to write to that format, though. The same version 2.11 program would not be able to correctly read a version 3.01 file either.

    I have not implemented ECC in the formats yet, but maybe the next time I do an overhaul ... I will have to ponder that. Maybe not, my programs seem to ephemeral for that ... Then again, so did people think about their 1960es COBOL programs.
  • Do not compress! (Score:2, Interesting)

    by irp ( 260932 ) on Friday December 04, 2009 @09:13AM (#30322832)

    ... Efficiency is the enemy of redundancy!

    Old documents, saved in 'almost like ascii' is still 'readable'. I once salvaged a document from some obscure ancient word processor by opening it in a text editor. I also found some "images" (more like icons) on the same disk (a copy of a floppy), even these I could "read" (by changing the page width of my text editor to fit the width of the uncompressed image).

    As long as the storage space keep growing...

  • by Anonymous Coward on Friday December 04, 2009 @09:27AM (#30322902)

    I think the problem is more around silent (passive) data corruption and loss.

    It does become an interesting exercise when you are dealing with "off-line" type media like tape, DVD and rocks though - the greater the data density the greater the impact of media damage (entropy).

    So I guess there are two parts to this problem - how often to you validate your data, and how do you mitigate large scale errors. There are good solutions at the on-line media level (ZFS/RAID, etc..), but relatively weak at the offline level - anyone know of the equivalent RAID model for things like tape?

    Cheers,

    -I.

  • by khundeck ( 265426 ) on Friday December 04, 2009 @09:32AM (#30322938)
    Parchive: Parity Archive Volume Set

    It basically allows you to create an archive that's selectively larger, but contains an amount of parity such that you can have XX% corruption and still 'unzip.'

    "The original idea behind this project was to provide a tool to apply the data-recovery capability concepts of RAID-like systems to the posting and recovery of multi-part archives on Usenet. We accomplished that goal." [http://parchive.sourceforge.net/]

    KPH
  • Film and digital (Score:3, Interesting)

    by CXI ( 46706 ) on Friday December 04, 2009 @09:38AM (#30322976) Homepage
    Ten years ago my old company used to advocate that for individuals who wanted to convert paper to digital, they first put them on microfilm and then scan them. That way when their digital media got damaged or lost they could always recreate it. Film last for a long long time when stored correctly. Unfortunately that still seems the be the best advice, at least if you are starting from an analog original.
  • by Phreakiture ( 547094 ) on Friday December 04, 2009 @10:26AM (#30323388) Homepage

    What we need is a smarter computer that says, "I don't know what this is supposed to be, but here's my best guess," and displays noise. Let the brain then takeover and mentally remove the noise from the audio or image.

    Audio CDs have always done this. Audio CDs are also uncompressed*.

    The problem, I suspect, is that we have come to rely on a lot of data compression, particularly where video is concerned. I'm not saying this is the wrong choice, necessarily, because video can become ungodly huge without it (NTSC SD video -- 720 x 480 x 29.97 -- in the 4:2:2 colour space, 8 bits per pixel per plane, will consume 69.5 GiB an hour without compression), but maybe we didn't give enough thought to stream corruption.

    Mini DV video tape, when run in SD, uses no compression on the audio, and the video is only lightly compressed, using a DCT-based codec, with no delta coding. In practical terms, what this means is that one corrupted frame of video doesn't cascade into future frames. If my camcorder gets a wrinkle in the tape, it will affect the frames recorded on the wrinkle, and no others. It also makes a best-guess effort to reconstruct the frame. This task may not be impossible with more dense codecs that do use delta coding and motion compensation (MPEG, DiVX, etc), but it is certainly made far more difficult.

    Incidentally, even digital cinemas are using compression. It is a no-delta compression, but the individual frames are compressed in a manner akin to JPEGs, and the audio is compressed either using DTS or AC3 or one of their variants in most cinemas. The difference, of course, is that the cinemas must provide a good presentation. If they fail to do so, people will stop coming. If the presentation isn't better than watching TV/DVD/BluRay at home, then why pay the $11?

    (* I refer here to data compression, not dynamic range compression. Dynamic range compression is applied way too much in most audio media)

  • by Hatta ( 162192 ) on Friday December 04, 2009 @10:52AM (#30323676) Journal

    Don't forget PAR2 [wikipedia.org]. I never burn a DVD without 10%-20% redundancy as par2 files. Even if the filesystem gets too damaged to read, I can usually dd the whole disk and let par2 recover the files.

  • by Rockoon ( 1252108 ) on Friday December 04, 2009 @01:05PM (#30325452)
    File Systems are in the software domain. If you arent getting good data (what was written) off the drive, the File System ideally shouldn't be able to do any better than the hardware did with the data. Of course, in reality the hardware uses a fixed redundancy model that offers less reliability than some people like. The danger of software-based solutions is that it allows hardware manufacturers to offer even less redundancy, or even NO redundancy at all, causing a need for even MORE software based redundancy.

    The ideal solution is to make sure the data is good at every step, rather than allow the device (or transmission medium) to consider that bad data is being good data. With ZFS or any other File System solution, the device wont know that the data is bad.. and thats bad.

If you have a procedure with 10 parameters, you probably missed some.

Working...