How Mailinator Compresses Its Email Stream By 90% 75
An anonymous reader writes "Paul Tyma, creator of Mailinator, writes about a greedy algorithm to analyze the huge amount of email Mailinator receives and finds ways to reduce its memory footprint by 90%. Quoting: 'I grabbed a few hundred megs of the Mailinator stream and ran it through several compressors. Mostly just stuff I had on hand 7z, bzip, gzip, etc. Venerable zip reduced the file by 63%. Not bad. Then I tried the LZMA/2 algorithm (7z) which got it down by 85%! Well. OK! Article is over! Everyone out! 85% is good enough. Actually — there were two problems with that result. One was that, LZMA, like many compression algorithms build their dictionary based on a fixed dataset. As it compresses it builds a dictionary of common sequences and improves and uses that dictionary to compress everything thereafter. That works great on static files — but Mailinator is not a static file. Its a big, honking, several gigabyte cache of ever changing email. If I compressed a million emails, and then some user wanted to read email #502,922 — I'd have to "seek" through the preceding half-million or so to build the dictionary in order to decompress it. That's probably not feasible.'"
Re:Author confused? (Score:2, Interesting)
One was that, LZMA, like many compression algorithms build their dictionary based on a fixed dataset. As it compresses it builds a dictionary of common sequences and improves and uses that dictionary to compress everything thereafter.
What?! LZMA keeps a dictionary of recent data, not a "fixed dataset".
Its a big, honking, several gigabyte cache of ever changing email. If I compressed a million emails, and then some user wanted to read email #502,922 — I'd have to "seek" through the preceding half-million or so to build the dictionary in order to decompress it. That's probably not feasible.'"
This is called a solid archive; what the author wants is a non-solid archive.
Seriously.
7zip. LZMA2. Whatever speed/compression setting you want (I always roll 9). Non-solid mode or a solid block size limited to whatever size you want (or whatever number of files you want, or both).
LZMA2 automagically does it's dictionary thing, and the non-solid nature does it per file, or if you limit solid block size it does it per group of n files or per group of files that fit in size x or both. If you have a lot of duplication across files so far apart that they won't share a dictionary under LZMA2, you can get some improvement by first creating a master dictionary (across all files, ignore non-solid mode or solid block limits) for those duplicated chunks and then writing down all the pointer locations for them, then sending the rest of the data to LZMA2 to be compressed.
Of course, this sort of shit is already supported by 7zip. Use PPMD or whatever method/filter you think is good for text, then send that shit to LZMA2.
This story is basically "I did something needlessly complex because I didn't RTFM for 7zip".
7z a -t7z emails.7z emailDirectory\ -m0=PPMd:mem=30:o=12 -m1=LZMA2 -mt=4 -mx=9 -ms=1024f256m
Add all files from "emailDirectory\" to "emails.7z" using the PPMd compressor with 1 GB of memory required (for compressing and decompressing) and a model order of 12, then pass it through LZMA2. Try to use 4 cores, use the "ultra" compression options, and make each block contain a maximum of 1024 files and have a maximum of 256 MB.
Re:You cut off at the good part. (Score:4, Interesting)
Deduplication is compression.