A Look at Data Compression 252
With the new year fast approaching many of us look to the unenviable task of backing up last years data to make room for more of the same. That being said, rojakpot has taken a look at some of the data compression programs available and has a few insights that may help when looking for the best fit. From the article: "The best compressor of the aggregated fileset was, unsurprisingly, WinRK. It saved over 54MB more than its nearest competitor - Squeez. But both Squeez and SBC Archiver did very well, compared to the other compressors. The worst compressors were gzip and WinZip. Both compressors failed to save even 200MB of space in the aggregated results."
Speed (Score:3, Insightful)
Re:Speed (Score:5, Informative)
Re:Speed (Score:5, Insightful)
Re:Speed (Score:3, Interesting)
If your file starts out as 250 mb, it might be worth it. However, if you start with a 2.5 gb file, then it's almost certainly not -- especially once you take the closed-source and undocumented nature of the compression algorithm into account.
Re:Speed (Score:5, Interesting)
another case is if you only have 100 megabytes you can use and only a zzzxxxyyy archiver can compress it into the 100mb while gzip -9 leaves you with 102mb.
so it really depends if you need it or not. sometimes you need it, mostly you don't.
but bashing on the issue "like nobody ever needs it" is certainly wrong.
Re:Speed (Score:2)
When you're talking about data files on the order of 2.5 GB, someone is going to find ANY solution other than GPRS. When you're talking about GPRS, even transatlantic sneakernet would be faster (and cheaper).
Plus many providers offer unlimited plans at higher monthly costs. (I know every US-based provider has unlimited data plans for under $10
Re:Speed (Score:3, Funny)
Re:Speed (Score:3, Funny)
"Never underestimate the bandwidth of a stationwagon full of tapes."
or the updated "Never underestimate the bandwidth of a 747 filled with DVDs".
Or the even more updated "Never underestimate the bandwidth of a 747 filled with 500GB HDDs".
Re:Speed (Score:3, Informative)
gzip -9. (My fastest computer is 900MHz AMD Duron.)
For quick backups, gzip; or gzip -6.
For REALLY quick stuff, gzip -1.
When I want the most space saved, I (rarely) use bzip2 because rar, while useful for splitting files and retaining recovery metadata, is far too slow for my taste 99% of the time.
Really, disk space is so cheap these days that Getting the Backup Done is more important than sav
Re:Speed (Score:2)
Re:Speed (Score:3, Insightful)
Exactly! We compress -terabytes- here at wr0k, and we use gzip for -nearly- everything (some of the older scripts use "compress",
Why? 'cause it's fast. 20% of space just isn't worth the time needed to compress/uncompress the data. I tried to be modern (and cool) by using bzip2, yes, it's great, saves lots of space, etc., but the time required to compress/uncompress is just not worth it. ie: if you need to compress/decompress 15-20gigs per day, bzip
Re:Speed (Score:2)
Re:Speed (Score:3, Insightful)
Re:Speed (Score:2)
That said, chances are that in such situations you're just going to be better off figuring a way to span multiple DVDs, especially given that while increasing compression might be enough for you today, chances are that you're going to exceed the capacity of that single DVD soon no matter what compression technique you'll use.
Re:Speed (Score:2)
secondly, why do they have to put everything on 15 different pages, does it make it more organized? i think not. easier to read the article when everything is together.
Re:Speed (Score:5, Informative)
Re:Speed (Score:3, Funny)
Re:Speed (Score:5, Insightful)
The pages are shamefully loaded with ads! I could barely find the next-page links at the bottom of the window! At first, I thought a "Google Ad" link labeled "compression" might be the next page, and clicked on it! And the true link is oddly hidden in small print, in a corner beneath a large table of PriceGrabber comparison results.
The article is basically unreadable, I'd say, due to the ads.
Re:Why compress in the first place? (Score:5, Insightful)
Sometimes, people have to download things.
Re:Why compress in the first place? (Score:2)
Re:Why compress in the first place? (Score:2)
The reason why they usually come in a bunch of rar files is because that's the best way to distribute files over usenet (where one big file usually doesn't work as well as a whole pile of smaller ones for a bunch of reasons). The RAR format is just a convienent way to split the files up. Then when people make the torrents or whatever out of the files, the RAR compression sometime
Re:Why compress in the first place? (Score:2)
That's kinda irritating, since bittorrent doesn't really care how big files are, and file integrity is handled by the client and the
Re:Why compress in the first place? (Score:3, Insightful)
Compressing files with a good compression program does not increase the chance of it being corrupted.
And, the majority of files people send to each other, etc, aren't simply ascii files. (even if yours are).
The other advantage of using a compression program is the majority of them create archives and allow you to consolidate all the related files.
A good archive/compression program will add a couple of percent of reduntancy data which can substantially
Re:Why compress in the first place? (Score:5, Interesting)
The solution to this issue is popular on usenet, since it's common for large files to be damaged. There's a utility called par2 that allows recovery information to be sent, and it's extremely effective. It's format-neutral, but most large binaries are sent as multi-part RAR archives. par2 can handle just about any damage that occurs, up to and including missing files.
Most of the time however, when it's simply someone downloading something it is only necessary to detect damage so they can download it again. All the formats I have experience with can detect damage, and it's common for MD5 and SHA1 sums to be sent separately anyway for security reasons.
Re:Why compress in the first place? (Score:2)
DAR sounds like a cool utility [linux.free.fr]. Do you know of something similar for my Windows boxes?
--Lance
Re:Why compress in the first place? (Score:2)
Re:Why compress in the first place? (Score:5, Interesting)
Because when you are storing Petabytes of information it makes a difference in cost.
Besides, all the problems you mention with data coruption can be solved by backing up the information more than once. Anyplace that places a high value on there info is going to have multiple backups in multiple places anyways. The most usefull application of compression is in archiving old customer records. Being mostly text, you can easily get above 50% compression ratios. Also, these are going to be backed up to tape (not disk). Being able to reduce the volume of tapes being stored by 50% can save a lot of money for a large organization.
What about speed? (Score:2)
Outside of the pure speed issue, what media swapping? Once you exceed the media capacity (I'm talking removable media), the media needs to be swapped which not only takes t
Re:What about speed? (Score:2)
Re:What about speed? (Score:2)
On tape, this is not an issue. Serious tape libraries are automated. An arm manually loads in and extracts tapes used in backup.
This is all fine, until you need one more tape than your library holds :).
Mind you, I'm also assuming that any one really worrying about this is going to be "serious". LTO tapes (great for long term backup) hold 400GB (LTO3). Transfer speed is about 20MB/s (yes, megabytes).
So, to improve your overall throughput your compression needs to get anything slightly better than a
Re:Why compress in the first place? (Score:2)
You are better off not compressing and then storing ECCs. Although you are better off still by compressing and *still* storing ECCs.
Re:Why compress in the first place? (Score:2)
That sounds reasonable. A server farm of 1,000 could knock that out in just under two and a half years.
You'd probably use gzip to save time though.. Of course you'll still need 1,000 CPUs to get it done in under two weeks, or triple that if you're only working in off-peak hours because you don't have a thousand servers sitting around doing nothing.
But I'm guessing anyone who has PBs of data to store is not working on a shoestring budget, and not pa
Re:Why compress in the first place? (Score:2)
Re:Why compress in the first place? (Score:4, Interesting)
Not all data is stored in ASKII and or ANSI. Compressing the data can make it more secure not less.
1. It takes up less sectors of a drive so it is less likely to get corrupt.
2. Can contain extra data to recover from bad bits.
3. Allows you to make redundant copies without using any more storage space.
Let's say that you have some files that are in ASCII you want to store. Using any compression method you can probably store 3 copies of the file using the same amount of disk space.
You are far more likely to recover a full data set from three copies of compressed file than from one copy of an uncompressed file.
Also we do not have unlimited bandwidth and unlimted storage EVERYWHERE.Loseless video, image, and audio files take up a lot of space. For some applications MP3, Ogg, MPG, and JPEG just don't cut it.
So yes compression still is important.
Re:Why compress in the first place? (Score:2)
compression can have more uses than simply saving space.
Re:Why compress in the first place? (Score:3, Informative)
Parity reconstruction
Think of it like the year 2805 where scientists can regrow someones arm if they happen to lose it
Because it makes a hell of a lot of sense. (Score:5, Insightful)
Let's say you have a 200MB file to send. You could just send the 200MB file, with no guarantees that it will reach the destination uncorrupted. Or, you could use a compression program and bring it down to 100MB. In this case, even if you lost the first transfer, you could transfer it a second time. Then we look at PAR. You compress the 200MB file into ten 10MB files. Then, you could include 10% parity - if any of your files is bad, you'd be able to reconstruct it with the parity file. With only 110MB of transfer. PAR2 goes even further by breaking down each file into smaller units.
Besides transfer times and correction for network transfers, compression can also increase speeds of transfer to mediums. If you have an LTO tape drive that can only write to tape at 20MB/sec, you'll only ever get 20MB/sec. Add compression to the drive, and you could theoretically get 40MB/sec to tape with 2:1 compression. That means faster backups, and faster restores. On-board compression in the drives takes all the load off the CPU - but even if you use the CPU for it, they're fast enough to handle it.
Not to mention, it takes a lot less tape to make compressed backups. I don't know what world you live in, but in mine, I don't have unlimited slots in the library and I don't want to swap tapes twice a day. Handling tapes is detremental to their lives; you really want to touch them as least as possible.
Data corruption isn't caused by compression. If it's going to happen, it'll happen regardless. While your point is true that it MAY be more difficult to recover from a corrupt file, that's not the right methodology. If your backups are that valuable, you'd make multiple copies - plain and simple.
I can't fathom why a responsible and well informed admin would avoid compression.
Re:Because it makes a hell of a lot of sense. (Score:3, Insightful)
The only arguement for compression is not the cost of media - in fact
Re:Why compress in the first place? (Score:2)
More time = More compression (Score:5, Insightful)
The one surprising thing I found in the article was that two virtually unknown contenders - WinRK and Squeez did so well. One disappointing obvious follow-up question would be how more well-known applications such as WinZip or WinRAR (which have a more mass-appeal audience) stack up against them with their configurable higher-compression options.
Re:More time = More compression (Score:3, Funny)
I had a good laugh at that one when I figured out how it worked, way back in the BBS days.
Re:More time = More compression (Score:2, Interesting)
Not only time, but also how much memory the algorithm uses, though the author did not mention how much space each algorithm uses. gzip, for instance, does not use much, but others, like rzip ( http://rzip.samba.org/ [samba.org]) uses alot. rzip may use up to 900MB during compression.
I did a test with compressing a 4GB tar archive with
Re:More time = More compression (Score:2)
Re:More time = More compression (Score:5, Interesting)
So, I would consider gzip the best performer by this criteria. After all, if I cared most about space savings I'd have picked the best-mode - not the fast-mode. All this articles suggests is that a few archivers are REALLY lousy for doing FAST compression.
If my requirements were realtime compression (maybe for streaming multimedia) then I wouldn't be bothered with some mega-compression algorithm that takes 2 minutes per MB to pack the data.
Might I suggest a better test? If interested in best compression, then run each program in a mode which optimizes purely for compression ratio. On the other hand, if interested in realtime compression then take each algorithm and tweak the parameters so that they all run in the same time (which is a realtively fast time), and then compare compression ratios.
With the huge compression of multimedia files I'd also want the reviewers to state explicity that the compression was verified to be lossless. I've never heard of some of these proprietary apps, but if they're getting significant ratios out of
Compressia (Score:2)
Re:Compressia (Score:2, Informative)
Re:Compressia (Score:2)
WinRK is excellent (Score:5, Interesting)
Nice Comparison... (Score:5, Insightful)
I personally use 7-Zip. It doesn't perform the best but it is free software and it includes a command line component that it nice for shell scripts.
Re:Nice Comparison... (Score:2)
There are better compressors out there, in particular PPM codecs can achieve spectacular ratios, but as they're very slow to both compress and decompress they're usefu
Re:Nice Comparison... (Score:2, Interesting)
Speed is better than bzip2 and compression is top class, beaten only by 7zip and LZMA compresserors (which require much more speed and memory). Problem is that decompression is the same speed as the compression, unlike bzip2/gzip/zip where the decompression is much faster
The review quoted above is totally useless because 7zip for example uses a 32Kb dictionary. Given a 200Mb dictionary it really start
Re:Nice Comparison... (Score:2)
My box has always been fairly stable, but even more so under SP2.
Re:Nice Comparison... (Score:2)
Windows only (Score:3, Interesting)
Actually (Score:5, Interesting)
This is a surprisingly big subject (Score:5, Informative)
Re:This is a surprisingly big subject (Score:2)
Still, worth remembering, especially as these algorithms are being improved all the time.
Open formats and long-term accessibility (Score:5, Insightful)
The same can't be said for WinRK. Therefore, if you plan to want access to your data for a long period of time, you should carefully consider whether the format will be accessible.
Unix compressors (Score:5, Interesting)
Re:Unix compressors (Score:2)
Re:Unix compressors (Score:2)
Re:Unix compressors (Score:2)
Re:Unix compressors (Score:2)
Size doesn't depend on the OS it was compressed on (generally - perhaps a small bit, at most). So he compressed it for size on Windows (or an OS with an ACE compressor).
Speed, however, does depend on the OS it was compressed on. Much more than size, at any rate. So the results would have been skewed in one direction or the other, due to the OS.
Re:Unix compressors (Score:2)
Sorry for nit picking, but come on how can you go use WinACE on Windows to do the size compressions and then use all the other compressors
Re:Unix compressors (Score:2)
Re:rzip? (Score:2)
Re:Unix compressors (Score:2)
Just use DiskDoubler (Score:5, Funny)
Its running perfectly fine on my Mac IIci.
Re:Just use DiskDoubler (Score:3, Funny)
Re:Just use DiskDoubler (Score:3, Insightful)
Re:Just use DiskDoubler (Score:2)
Input type? (Score:3, Interesting)
Also, do any of you know any lossless algorithms for media (movies, images, music, etc)? Most algorithms perform poorly in this area, but I thought that perhaps there were some specifically designed for this.
Re:Input type? (Score:3, Interesting)
Why compress in weird formats? (Score:5, Insightful)
I generally prefer gzip/7-Zip.
The reasoning is simple, I can use the results cross platform without special costly software. A few extra bytes of space is secondary.
For many files, I also find buying a larger disk a cheaper option than spending hours compressing/uncompressing files. So I generally only compress files I don't think I will need that are very compressable.
Re:Why compress in weird formats? (Score:3, Insightful)
Re:Why compress in weird formats? (Score:2)
Has anyone heard of WinUHA yet? That is supposed to be pretty good, and I'd not mind testing out other archivers, as long as the time savings on transferring smaller files aren't overtaken by the compression/decompression time. Though, again, all these things are useless if no one can uncompress them.
Re:Why compress in weird formats? (Score:3, Interesting)
small mistake (Score:5, Interesting)
Since WinZip does not handle .7z, .ace or .rar files, it has lost much of its appeal for me. With my old serial no longer working, I now have absolutely no reason to use it. Now when I need a compressor for Windows I choose WinAce & 7-Zip. Between those two programs, I can de-/compress just about any format you're likely to encounter online.
Compress to 0K (Score:2, Funny)
I carry all data of my entire serverfarm like that on a 128Mb USB-stick.
Nothing to see here (Score:5, Informative)
I can't believe TFA made /. The only thing more defective than the benchmark data set (Hint: who cares how much a generic compressor can save on JPEGs?) is the absolutely hilarious part where the author just took "fastest" for each compressor and then tried to compare the compression. Indeed, StuffIt did what I consider the only sensible thing for "fastest" in an archiver, which is to just not even try to compress content that is unlikely to get significant savings. Oddly, the list for fastest compression is almost exactly the reverse of the list for best compression on every test. The "efficiency" is a metric that illuminates nothing. An ROC plot of rate vs compression for each test would have been a good idea; better would be to build ROC curves for each compressor, but I don't see that happening anytime soon.
I wouldn't try to draw any conclusions from this "study". Given the methodology, I wouldn't wait with bated breath for parts two and three of the study, where the author actually promises to try to set up the compressors for reasonable compression, either.
Ouch.
Re:Nothing to see here (Score:3, Informative)
That's sort of the point of this test though, to see which of the general-purpose compressors (GPC) is going to give you the best overall results. Yes, you should use FLAC [sourceforge.net] for WAVs, and probably StuffIt [stuffit.com] for JPEGs, but what is your best choice if you're going to have just one, or just a few? I don't want 200 different compressors for 200 different content typ
Maximum Compression has efficiency comparisons (Score:5, Informative)
Related Links Broken (Score:3, Funny)
I've searched the FAQ, but I can't figure out how to contact slashdot admins. Does anyone know an email address or telephone number I can use to contact them about this serious problem? I'm sure they'll want to fix it as quickly as possible.
No one ever looks at rzip (Score:4, Interesting)
Decompression Speed (Score:4, Interesting)
Unicode support? (Score:3, Informative)
I've been using 7-Zip for this reason, and also because it compresses well while also working on Windows and Linux.
accuracy test missing (Score:2, Insightful)
There's an article in there somewhere? (Score:5, Insightful)
JPG compression (Score:5, Interesting)
I'd heard the makers of Stuffit were claiming this, but I was sceptical, it's good to see independant confirmation.
Completely out of context (Score:5, Informative)
Why does ANYBODY Bother with WinZip? (Score:4, Interesting)
Proprietary, costs money...
I use ZipGenius - handles 20 compression formats including RAR, ACE, JAR, TAR, GZ, BZ, ARJ, CAB, LHA, LZH, RPM, 7-Zip, OpenOffice/StarOffice Zip files, UPX, tc.
You can encrypt files with one of four algorhythms (CZIP, Blowfish, Twofish, Rijndael AES).
If you set an antivirus path in ZipGenius options, the program will prompt you to perform an AV scan before running the selected file.
It has an FTP client, TWAIN device image importing, file splitting, convert RAR into SFX, converts any Zip archive into an ISO image file, etc.
And it's totally free.
Re:Why does ANYBODY Bother with WinZip? (Score:2)
This test is worthless (Score:4, Informative)
Re:This test is worthless (Score:2)
Not to mention that some programs differ a LOT between fasterst and slowest and some dont...
Its just bullshit.
Same for his example data: nearly EVERYTHING there was already compressed inside the file container... who the fuck wants to save space by compressing video or jpgs?
A real field would be stuff where compression actually saves something, like log files. A look at maximumcompression tells me that there are programs that can compress apache logs to less than the half
Lest We Forget - Philip W. Katz (Score:4, Interesting)
Phillip W. Katz, better known as Phil Katz (November 3, 1962-April 14, 2000), was a computer programmer best-known as the author of PKZIP, a program for compressing files which ran under the PC operating system DOS.
http://en.wikipedia.org/wiki/Phil_Katz [wikipedia.org]
Embarassing ads - This is an ad cash-grab (Score:3, Insightful)
Please editors, check the sites out first. If it's 90% ads and impossible to navigate without clicking ads accidentally, it's just some losers cash-grab site.
Pile of ad-laden shit article (Score:3, Funny)
What a steaming pile of shit. Happy new year.
Re:Quite interesting (Score:2)
Re:Speaking of Comparisons (Score:3, Interesting)
I knew I had seen this story before but it wasn't here. This article was up on Digg three days ago [digg.com]--with only three Diggs to it's name (at the time of this writing), but it's front page news here? Interesting to say the least...
I predict that this Digg [digg.com] will become frontpage Slashdot news shortly. It was quite popular (914 diggs so far) and it's hit the three-day mark...
I know, this is all so OT, but it's no worse then whining about duplicate postings here.
Wrong, wrong, wrong, wrong, wrong. (Score:3, Informative)
Huffman coding and arithmetic coding are both entropy encoding algorithms. While perfectly fine compression algorithms in their own right, they're also commonly used to squeeze the last bits of entropy out of a data stream produced by another compression or transformation algorithm. Arithmetic coding suffers from chilling effects caused by IBM patents, and so isn't as commonly used as it might. An unencumbered alternative is range encoding, which gives performance not too far off that of arithmetic coding.