One Way To Save Digital Archives From File Corruption 257
storagedude points out this article about one of the perils of digital storage, the author of which "says massive digital archives are threatened by simple bit errors that can render whole files useless. The article notes that analog pictures and film can degrade and still be usable; why can't the same be true of digital files? The solution proposed by the author: two headers and error correction code (ECC) in every file."
To much reinvention (Score:5, Interesting)
If this type of thing is implemented at the file level every application is going to have to do its own thing. That means to many implementations most of which wont be very good or well tested. It also means applications developers will have to be busy slogging though error correction data in their files rather than the data they actually wanted to persist for their application. I think the article offers a number of good ideas but it would be better to do most of them at the filesystem and perhaps some at the storage layer.
Also if we can present the same logical file when read to the application even if every 9th byte is parity on the disk that is a plus because it means legacy apps can get the enhanced protection as well.
Re: (Score:2)
Re:To much reinvention (Score:5, Insightful)
Ahem. RAID anyone? ZFS? Btrfs? Hello?
Isn't this what filesystem devs have been concentrating on for about 5 years now?
Re:To much reinvention (Score:4, Informative)
"says massive digital archives are threatened by simple bit errors that can render whole files useless.
Isn't this what filesystem devs have been concentrating on for about 5 years now?
Not just 5 years. ZFS's CRC on every datablock and Raid Z (no raid hold) are innovative and obviously the next step in filesystem evolution. But attempts at redundancy aren't new. I'm surprised the article is discussing relatively low teck old hat ideas such as two filesystem headers. Even DOS's FAT used this raid0 type of brute force redundancy by having two FAT tables. The Commodore Amiga's Intuition filesystem did this better than Microsoft back in 1985 by having forward and backward links in every block which made it possible to repair block pointer damage by searching for a reference to the bad block in the preceding and following block.
And I suppose if ZFS doesn't catch on, 25 or 30 years from now Apple or Microsoft will finally come up with it and say, "Hey look what we invented!"
Re: (Score:2)
Not all RAID implementations check for errors. Some cheaper hardware will duplicate data or make parity, but never check for corruption. Instead, they only use the duplicated data for recovery purposes. Nothing says fun like rebuilding a RAID drive only to find your your parity data was corrupt, but you won't know that until you try to use your new drive and weird things happen..
Will the corruption affects your FS or will it affect your data.. 8-ball says........
Re: (Score:2)
RAID is not backup or archive. If you have a RAID1 system with bit errors on one disk - you now have them on the other disk.
Using more complex RAID configs does not necessarily solve the problem.
With archiving, the problem becomes apparent after you pull out your media after 7 years of not using it. Is that parallel ATA hard drive you stored it on still good? Do you have a connection for it? What about those zip and jazz drives you used? Do you have a method of reading them? Are those DVDs and CDs still goo
Re: (Score:3, Interesting)
Re: (Score:3, Insightful)
I agree that filesystem level error correction is good idea. Having the option to specify ECC options for a given file or folder would be great functionality to have. The idea presented in this article, however, is that certain compressed formats don't need ECC for the entire file. Instead, as long as the headers are intact, a few bits here or there will result in only some distortion; not a big deal if it's just vacation photos/movies.
By only having ECC in the headers, you would save a good deal of storage
Brave New World (Score:2)
Consumers should welcome file corruption; it's a chance to throw away those old files and buy some brand new ones instead
Actually, I would not be surprised if the media companies were busily trying to invent a self-corrupting DRM format to replace DVDs and suchlike.
Re: (Score:2)
> Actually, I would not be surprised if the media companies were
> busily trying to invent a self-corrupting DRM format to replace
> DVDs and suchlike.
Any physical medium is automatically self-corrupting. Leave a DVD on the back shelf of a car for a summer afternoon if you don't believe that.
Re:To much reinvention (Score:4, Insightful)
If this type of thing is implemented at the file level every application is going to have to do its own thing. That means to many implementations most of which wont be very good or well tested. It also means applications developers will have to be busy slogging though error correction data in their files rather than the data they actually wanted to persist for their application. I think the article offers a number of good ideas but it would be better to do most of them at the filesystem and perhaps some at the storage layer.
Also if we can present the same logical file when read to the application even if every 9th byte is parity on the disk that is a plus because it means legacy apps can get the enhanced protection as well.
Precisely. This is what things like torrents, RAR files with recovery blocks, and filesystems like ZFS are for: so every app developer doesn't have to roll their own, badly.
Re:To much reinvention (Score:5, Interesting)
Don't forget PAR2 [wikipedia.org]. I never burn a DVD without 10%-20% redundancy as par2 files. Even if the filesystem gets too damaged to read, I can usually dd the whole disk and let par2 recover the files.
Re: (Score:2)
A simple solution would be to add a file full of redundancy data alongside the original on the archival media. A simple application could be used to repair the file if it becomes damaged, or test it for damage before you go to use it, but the original format of the file remains unchanged, and your recovery system is file system agnostic.
Re: (Score:2)
How about an archive format, essentially a zip file, which contains additional headers for all of it's contents? Something like a manifest would work well. It could easily be an XML file with the header information and other meta-data about the contents of the archive. This way you get a good compromise between having to store an entire filesystem and the overhead of putting this information in each file.
Add another layer by using a striped data format with parity and you have the ability to reconstruct any
Re: (Score:2)
Also if we can present the same logical file when read to the application even if every 9th byte is parity on the disk that is a plus because it means legacy apps can get the enhanced protection as well.
Not to nitpick, but I'm going to nitpick...
Parity bits don't allow you to correct an error, only check that an error is present. This is useful for transfer protocols, where the system knows to ask for the file again, but the final result on an archived file is the same: file corrupt. With parity, though, you have an extra bit that can be corrupted.
In other words, parity is an error check system, what we need is ECC (Error Check and Correct). Single parity bit is a 1-bit check, 0-bit correct. we need
Re: (Score:2)
Sorry, messed up my abbreviations.
We want Error Detect and Correct.
ECC is an Error Correcting Code.
Re: (Score:2)
>>>>> If this type of thing is implemented at the file level every application is going to have to do its own thing.
>>
>>Great. So add it to the system level.
Somebody has ADD and didn't bother to finish reading the *whole* paragraph. Quote: "It would be better to do most of them at the filesystem..."
Re: (Score:2)
Re:To much reinvention (Score:5, Insightful)
Re: (Score:3, Funny)
Re: (Score:2)
You have just taken the hideous word 'boxen' to new heights!
True, though, ZFS is a bleedin obvious implementation of this kind of thing
Re: (Score:2)
Not does it just contain ECC for each block, but the ECC for the block is not contained within the block, so you don't run into problems with self consistent but still bad blocks.
Re: (Score:2)
I agree with you about ZFS. Getting Sun to dual license it under the GPL would mean it becoming a de facto standard everywhere but Macs and Windows. However, we should have ECC in the file format, so regardless of what type of machine a file is stored on, there is a good chance of repairs for a file.
On an ISO basis, one can use a utility like DVDDisaster to append ECC info to the end of an ISO file before burning to CD or DVD. Since this is transparent to user action, the use of the added ECC only really
Re: (Score:2)
Actually ZFS does not use ECC unless you are using a RAIDZ. It does use checksums for all blocks, but that can only detect error, if you're not using RAIDZ or a mirror or have copies=(2|3) the error will be fatal, but at least you'll know about it.
Re: (Score:2)
Having ECC in metadata is fine, but in a perfect world, applications shouldn't ignore ECC. Especially when a file is being used and modified. Instead, either the app should update the ECC records when any writes are performed, or at least when the file is being closed.
par files (Score:5, Informative)
include par2 files
Re: (Score:2)
I'm glad I didn't have to scroll down too far to see this. They're practically magic.
Re: (Score:2)
Or just use PAR for your archives (Score:2)
Done. +1 to the poster who said there is some round transportation implement being reinvented here.
It's that computer called the brain. (Score:5, Interesting)
>>>"...analog pictures and film can degrade and still be usable; why can't the same be true of digital files?"
The ear-eye-brain connection has ~500 million years of development, and has learned the ability to filter-out noise. If for example I'm listening to a radio, the hiss is mentally filtered-out, or if I'm watching a VHS tape that has wrinkles, my brain can focus on the undamaged areas. In contrast when a computer encounters noise or errors, it panics and says, "I give up," and the digital radio or digital television goes blank.
What we need is a smarter computer that says, "I don't know what this is supposed to be, but here's my best guess," and displays noise. Let the brain then takeover and mentally remove the noise from the audio or image.
Re:It's that computer called the brain. (Score:5, Interesting)
P.S.
When I was looking for a digital-to-analog converter for my TV, I returned all the ones that displayed blank screens when the signal became weak. The one I eventually chose (x5) was the Channel Master unit. When the signal is weak it continues displaying a noisy image, rather than go blank, or it reverts to "audio only" mode, rather than go silent. It lets me continue watching programs rather than be completely cutoff.
Re: (Score:3, Insightful)
Re: (Score:2)
>>>And how well did that work for your last corrupted text file?
Extremely well. I read all 7 Harry Potter books using a corrupted OCR scan. From time to time the words might read "Harry Rotter pinted his wand at the eneny," but I still understood what it was supposed to be. My brain filtered-out the errors. That corrupted version is better than the computer saying, "I give up" and displaying nothing at all.
.
>>>Say a random bit is missed and the whole file ends up shifted one to the lef
Re: (Score:2)
Sure, but you'd need to not compress anything and introduce redundancy. Most people prefer using efficient formats and doing error checks and corrections elsewhere (in the filesystem, by external correction codes - par2 springs to mind, etc).
Here's a simple image format that will survive bit flip style corruption:
Every pixel is stored as with 96 bits, the first 32 is the width of the image, the next 32 is the height, the next 8 bits is the R, the next 8 bits is the G, the next 8 bits is the B, and the last
Re:It's that computer called the brain. (Score:5, Interesting)
Audio CDs have always done this. Audio CDs are also uncompressed*.
The problem, I suspect, is that we have come to rely on a lot of data compression, particularly where video is concerned. I'm not saying this is the wrong choice, necessarily, because video can become ungodly huge without it (NTSC SD video -- 720 x 480 x 29.97 -- in the 4:2:2 colour space, 8 bits per pixel per plane, will consume 69.5 GiB an hour without compression), but maybe we didn't give enough thought to stream corruption.
Mini DV video tape, when run in SD, uses no compression on the audio, and the video is only lightly compressed, using a DCT-based codec, with no delta coding. In practical terms, what this means is that one corrupted frame of video doesn't cascade into future frames. If my camcorder gets a wrinkle in the tape, it will affect the frames recorded on the wrinkle, and no others. It also makes a best-guess effort to reconstruct the frame. This task may not be impossible with more dense codecs that do use delta coding and motion compensation (MPEG, DiVX, etc), but it is certainly made far more difficult.
Incidentally, even digital cinemas are using compression. It is a no-delta compression, but the individual frames are compressed in a manner akin to JPEGs, and the audio is compressed either using DTS or AC3 or one of their variants in most cinemas. The difference, of course, is that the cinemas must provide a good presentation. If they fail to do so, people will stop coming. If the presentation isn't better than watching TV/DVD/BluRay at home, then why pay the $11?
(* I refer here to data compression, not dynamic range compression. Dynamic range compression is applied way too much in most audio media)
Re: (Score:2)
Re: (Score:2)
The theatres in our area don't have sticky floors! I feel cheated!
Re: (Score:2)
You're right about CD's always having contained this type of ECC mechanism. For that matter you will see this type of ECC in Radio based communications infrastructure and data that gets written to Hard disks too! In other words - all modern data storage devices (except maybe Flash..) contain ECC mechanisms that allow burst error detection and correction.
So now we're talking about being doubly redundant. Put ECC on the ECC? I'm not sure that helps.
Consider - if a bit twiddles on a magnetic domain. It wi
Re: (Score:2)
Having dealt with corrupted DVDs, I do.
You could always put the redundancy data someplace else entirely, too. Realistically, this is an extension of the RAID concept.
Re: (Score:2)
Re: (Score:2)
If the presentation isn't better than watching TV/DVD/BluRay at home, then why pay the $11?
To get away from the kids? To make out in the back row?
Seriously people dont go to the movie theatre for the movies mostly.
Also VLC tends to do best guess of the frame when there is corruption, it's usually a pretty bad guess but it still tries.
Re: (Score:2)
This was extremely useful in the 1980s when certain television channels were available from Sweden and Holland but only in a scrambled form.
Re: (Score:2)
but how much of our ability to "read through" noise is because we know what data to expect in the first place?
I know when listening to a radio station that's barely coming in, many times it will sound like random noise, but at some point I will hear a certain note or something and suddenly I know what song it is. Now that I know what song it is, I have no problems actually "hearing" the song. I know what to expect from the song or have picked up on certain cues and I'm guessing that pattern recognition goes
Re: (Score:2)
That would be trivial to do if we were still doing BMP and WAV files, where one bit = one speck of noise. But on the file/network level we use a ton of compression, and the result is that a bit error isn't a bit error in the human sense. Bits are part of a block, and other blocks depend on that block. One change and everything that comes after until the next key frame changes. That means two vastly different results are suddenly a bit apart.
Of course, no actual medium is that stable which is why we use erro
Re: (Score:2)
Hmmm. My DVD player just displays a blue screen when it encounters a corrupt MPEG-2 stream
Sun Microsystems..... zfs..... (Score:3, Insightful)
ZFS.
Next topic....
Re: (Score:2)
Once the files are written to CD or tape you lose the advantage that hardware or filesystem protection gave you
PAR2. Better?
Re: (Score:2)
Why? There's no reason a filesystem like ZFS can't be used on CD or tape and a lot of people do use them.
Even if you didn't want to do that, ISO 9660, the filesystem used by default on data CDs, contains its own error correction scheme (288 bytes of redundancy for every 2048 byte block).
Re: (Score:2)
> The Linux powers-that-be are the ones that chose to distribute their product under a license that can't be mixed with others.
That was like... 25 years ago.
That excuse doesn't really work well for anything released recently.
No. It's the authors of newer works that choose not to "play nice".
What files does a single bit error destroy? (Score:3, Insightful)
Re:What files does a single bit error destroy? (Score:4, Informative)
Re: (Score:2, Funny)
Perhaps that is what the poster meant by "bad spot". If "Hitler" were altered to read as "Hatler", I'm pretty sure the meaning would still be clear from the context.
Godvin.
Re: (Score:2)
I actually checked to see if 'v' and 'w' were different by only a single bit. * facepalm *
Re: (Score:2)
People like you are worse than Hatter!
Re: (Score:2)
People like you are worse than matter!
Re: (Score:2)
Hrm.. You may have just explained the Nostradamus prediction of "Hister" instead of "Hitler". A single bit corruption in the "From the Future Media Stream" he was watching.
Yes indeed... I blame the Large Hadron Collider for this.
Re:What files does a single bit error destroy? (Score:5, Insightful)
Pretty much only the prefix-code style compression schemes (Huffman for one) will isolate errors to short sgements, and then only if the compressor is not of the adaptive variety.
Re: (Score:2)
Re: (Score:2)
With that in mind, what program would be best to use? I take it WinRAR (which uses RAR and zip compression), or 7-Zip would be of no use...?
Re: (Score:2)
I'd venture to say TrueCrypt containers, when that corruption occurs at the place where they store the encrypted symmetrical key. Depending on the size of said container it could be the whole harddisk. :)
Re: (Score:2)
Fortunately, newer versions of TC have two headers, so if the main one is scrozzled, you can check the "use backup header embedded in the volume if available" under Mount Options and have a second chance of mounting the volume.
Re: (Score:2)
I've got a 10 gig .tar.bz2 file that I've only been partially able to recover due to a couple of bad blocks on a hard drive. I ran bzip2recover on it, which broke it into many, many pieces, and then put them back together into a partially recoverable tar file. Now I just can't figure out how to get past the corrupt pieces.:(
I've lost jpgs that way (Score:2)
Re: (Score:2)
With modern compression and encryption algorithms, a single bit flipped can mean a *lot* of downstream corruption, especially in video that uses deltas, or encryption algorithms that are stream based, so all bits upstream have some effect as the file gets encrypted. A single bit flipped will render an encrypted file completely unusable.
Easy... (Score:3, Funny)
Don't save anything.
What about the "block errors"? (Score:5, Informative)
Off course this can be fixed by "block redundancy" (like RAID does), "block recovery checksums" or old-fashioned backups.
Re: (Score:2)
Re: (Score:2)
Have you ever wondered why the numbers in RAID levels are what they are? RAID-1 is level 1 because it's like the identity property; everything's the same (kinda like multiplying by 1). RAID-0 is level 0 because it is not redundant; ie, there's zero redundency (as in, multiplying by 0). It's an array of inexpensive disks, ure, but calling RAID-0 a RAID is a misnomer at best. RAID-0 was not in the original RAID paper, in fact.
No one talking about data protection uses RAID-0, and it's therefore irrelevant
Re: (Score:2)
It means we need error correction at every level---error correction at physical device (already in place, more or less) and error correction at file system level (so even if a few blocks from a file are missing, the file system auto-corrects itself and still functions---upto some point of course).
Re:What about the "block errors"? (Score:4, Informative)
anyone know of the equivalent RAID model for things like tape?
Four tapes data, one tape PAR2.
About time (Score:3, Interesting)
I always create files with unique headers and consistent version numbering to allow for minor as well as major file format changes. For storage/exchange purposes, I make the format expandable where each subfield/record has an individual header with a field type and a length indicator. Each field is terminated with a unique marker (two NULL bytes) to make the format resilient to errors in the headers with possible resynchronisationthrough the markers. The format is in most situations backward compatible to a certain extent as an old program can always ignore fields/subfields it does not understand in a newer format file. If that is not an option, the major version number is incremented. This means that a version 2.11 program can read a version 2.34 file with only minor problems. It will not be able to write to that format, though. The same version 2.11 program would not be able to correctly read a version 3.01 file either.
I have not implemented ECC in the formats yet, but maybe the next time I do an overhaul
Lossy (Score:2, Insightful)
Because all of those are compressed, and take up a tiny fraction of the space that a faithful digital recording of the information on a film reel would take up. If you want lossless-level data integrity, use lossless formats for your masters.
Do not compress! (Score:2, Interesting)
... Efficiency is the enemy of redundancy!
Old documents, saved in 'almost like ascii' is still 'readable'. I once salvaged a document from some obscure ancient word processor by opening it in a text editor. I also found some "images" (more like icons) on the same disk (a copy of a floppy), even these I could "read" (by changing the page width of my text editor to fit the width of the uncompressed image).
As long as the storage space keep growing...
Re: (Score:2)
Better than not compressing, is compressing and using the space you save for parity data.
Very, very old news.... (Score:3, Informative)
It has been done like that for decades. Look at what archival tape does or DVDisaster or modern HDDs.
Also, this does not solve the problem, it just defers it. Why is this news?
Also, Bittorrent (Score:5, Informative)
I know it's not a perfect example, but just one way of looking at it.
Re: (Score:2)
Re: (Score:2)
Wait, I know! We'll make a torrent for the
Re: (Score:2)
Re: (Score:2)
Quickpar... (Score:2)
Quite frankly data is so duplicated today bit-rot is not really an issue if you know what tools to use, especially if you use tools like quickpar on important data that can handle bad blocks.
Much data is easily duplicated, the data you want to save if it is important should be backed up with care.
Even though much of the data I download is easily downloaded again, the stuff I want to keep I quickpar the archives and burn to disc, and really important data that is irreplacable I make multiple copies.
http://ww [quickpar.co.uk]
Parchive: Parity Archive Volume Set (Score:5, Interesting)
It basically allows you to create an archive that's selectively larger, but contains an amount of parity such that you can have XX% corruption and still 'unzip.'
"The original idea behind this project was to provide a tool to apply the data-recovery capability concepts of RAID-like systems to the posting and recovery of multi-part archives on Usenet. We accomplished that goal." [http://parchive.sourceforge.net/]
KPH
Solution: (Score:3, Insightful)
Re: (Score:3, Funny)
II ccaann ssuuggeesstt eevveenn bbeetteerr iiddeeaa.
Re: (Score:2)
New versions of ISO, ZIP and Truecrypt for this? (Score:2)
If done in an open standards way I could be somewhat confident of support in many years time when I may need to read the archives. Obviously backwards compatibility with earlier iso/file formats would be a plus.
Re: (Score:2)
Film and digital (Score:3, Interesting)
ZFS (Score:2)
Just use ZFS, its already been done.
kthxbai.
Cloud computing provides an opportunity (Score:4, Funny)
As we're on the cusp of moving much of our data to the cloud, we've got the perfect opportunity to improve the resilience of information storage for a lot of people at the same time.
Forward-error correction instead (Score:2)
I believe that Forward-error correction is an even better model. Already used for error-free transmission of data over error-prone links in radio, and USENET using the PAR format, what better way to preserve data than with FEC?
Save your really precious files as Parchive files (PAR and PAR2). You can spread them over several discs or just one disc with several of the files on it.
It's one thing to detect errors, but it's a wholly different universe when you can also correct them.
http://en.wikipedia.org/wiki [wikipedia.org]
Linearity is the real problem (Score:2, Insightful)
If I would want to make a fail proof image, I would split it to squares of, say, 9(3x3) pixels, and than put only central pixel(every 5th px) values in byte stream. Once that is done repeat that for s
Re: (Score:2)
I know I could still read the book even if every fifth letter was replaced by a incorrect one.
That would depend on the book. Your brain could probably error-correct a novel easily enough under those conditions (especially if it was every fifth character, and not random characters at a rate of 20%). But I doubt anyone could follow, say, a math textbook with that many errors.
zfec, Tahoe-LAFS (Score:2)
zfec is much, much faster than par2: http://allmydata.org/trac/zfec
Tahoe-LAFS uses zfec, encryption, integrity checking based on SHA-256, digital signatures based on RSA, and peer-to-peer networking to take a bunch of hard disks and make them into a single virtual hard disk which is extremely robust: http://allmydata.org/trac/tahoe
Clay Tablets (Score:2)
Just resign yourself to the fact that the Code of Hammurabi [wikipedia.org] will outlive your pr0n.
Three Headers, Not Two (Score:2)
> The solution proposed by the author: two headers and error correction code (ECC) in every file."
When there are two possibilities, which one do you chose? Three allows the software to have a vote among the headers, and ignore or correct the loser (assuming that there IS one, of course).
Also, keeping the headers in text, rather than using complicated encoding schemes to save space where it doesn't much matter, is probably a good idea, as well. Semantic sugar is your friend here.
Re: (Score:3, Insightful)
Because no-one yet has ever managed to pull things from this theoretical "historical" layer without at least something like a electron microscope costing tens or hundreds of thousands, thousands of hours of skilled *manual* work and having to crack the damn harddrive open and destroy it (if at all)? I believe there is a still a challenge going around with a hard drive that was "zeroed" quite simply and if anyone can recover the password in the single file that was on it before it was zeroed, then can get a
Re: (Score:2, Funny)
Re: (Score:2)
That would like, totally be possible! Except of course that you would probably find that there are a _LOT_ of files of 20GB or less that would give you those exact same hashes. Some of those files will be a valid blu-ray file. But apart from that, way to go!
It'd be interesting if someone where to do the math on this one, I think you'd be disappointed.
Re: (Score:3, Informative)
Asking for a definition of ecc [google.co.uk] turns it up, so it's obviously not that uncommon. And as we're talking about data corruption, it's the obvious one.
Most IT techs would recognise the term from "ECC Ram", which is ram that is capable of correcting bit errors and is often required by server motherboards.