How Mailinator Compresses Its Email Stream By 90%

How Mailinator Compresses Its Email Stream By 90% 75

Posted by Soulskill on Tuesday February 21, 2012 @06:10PM from the ever-faster-ever-smaller dept.

An anonymous reader writes "Paul Tyma, creator of Mailinator, writes about a greedy algorithm to analyze the huge amount of email Mailinator receives and finds ways to reduce its memory footprint by 90%. Quoting: 'I grabbed a few hundred megs of the Mailinator stream and ran it through several compressors. Mostly just stuff I had on hand 7z, bzip, gzip, etc. Venerable zip reduced the file by 63%. Not bad. Then I tried the LZMA/2 algorithm (7z) which got it down by 85%! Well. OK! Article is over! Everyone out! 85% is good enough. Actually — there were two problems with that result. One was that, LZMA, like many compression algorithms build their dictionary based on a fixed dataset. As it compresses it builds a dictionary of common sequences and improves and uses that dictionary to compress everything thereafter. That works great on static files — but Mailinator is not a static file. Its a big, honking, several gigabyte cache of ever changing email. If I compressed a million emails, and then some user wanted to read email #502,922 — I'd have to "seek" through the preceding half-million or so to build the dictionary in order to decompress it. That's probably not feasible.'"

How Mailinator Compresses Its Email Stream By 90%

This discussion has been archived. No new comments can be posted.

Search 75 Comments Log In/Create an Account

Comments Filter:

Because nobody RTFA (Score:5, Informative)

by Tyrannosaur ( 2485772 ) writes: on Tuesday February 21, 2012 @06:25PM (#39117139)

The end result is that he made his own compression-for-emails, where it scans strings in every email and stores the same strings in memory, with the emails storing only pointers to the strings.
For large emails (he says >20k as an estimate), he applies LZMA on top of that, with a sliding dictionary based on the emails from the last few hours or so.
All in all a very good read for someone (like me) who has an interest in data compression but knows little about it yet. I like to read other people's thought processes.
Reminds me of another good read I found in someone's ./ comment about "compressing" random data: http://www.patrickcraig.co.uk/other/compression.htm [patrickcraig.co.uk]

Re:Because nobody RTFA (Score:5, Informative)

by MagicM ( 85041 ) writes: on Tuesday February 21, 2012 @06:37PM (#39117297)

If you enjoyed reading that, you might also enjoy reading this [ejohn.org] and the follow-up [ejohn.org] about efficiently storing a dictionary of words and dealing with memory v.s processing trade-offs.

Mailinator Rocks (Score:4, Informative)

by Jah-Wren Ryel ( 80510 ) writes: on Tuesday February 21, 2012 @06:40PM (#39117309)

I use mailinator all the time, it is fantasticly useful. Sometimes I encounter a website that won't accept mailinator addresses, some even go to the effort of tracking the alternate domains he uses and blocking them too. I find mailinator so useful that when a website refuses mailionator addresses, I just won't use that website.
The Mailinator Man's blog is also pretty good, the guy is articulate and has a knack for talking about interesting architectural stuff. This latest entry is just another in a great series, if you like this sort of stuff and haven't read his previous entries you should take the time to read through them.

It's not the algorithm, it's the data (Score:5, Informative)

by Hentes ( 2461350 ) writes: on Tuesday February 21, 2012 @06:52PM (#39117429)

Mailinator can achieve high compression rates because most people use it for registration emails. Those mails differ from each other in only a few words, making the data set highly redundant, and easily compressible.

Re:You cut off at the good part. (Score:5, Informative)

by Nemesisghost ( 1720424 ) writes: on Tuesday February 21, 2012 @06:58PM (#39117481)
Actually there is even more to it than that.
1. 1. He broke each email up by lines(or multiple small lines, until the smallest unit was greater than 64 bytes).
2. 2. He compared each line to the set of line from previous emails, building a dictionary of lines that each has in common.
3. 3. Finally, if a "line" in the email is large enough he will LZMA compress it against a sliding dictionary of only the most recent emails
Re:It's not the algorithm, it's the data (Score:3, Informative)

by firewrought ( 36952 ) writes: on Tuesday February 21, 2012 @07:41PM (#39117983)

Mailinator can achieve high compression rates because most people use it for registration emails. Those mails differ from each other in only a few words, making the data set highly redundant, and easily compressible.
The accomplishment here is that he determined a very tactical set of strategies for solving a real world problem of large scale. No, it didn't take a math PhD with some deep understanding of Fourier analysis to invent this algorithm, but it most certainly took a software developer who was knowledgeable, creative, and passionate for his task. So yeah... it's not the 90% compression that's impressive, it's the real-time performance that's cool.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

How Mailinator Compresses Its Email Stream By 90% 75

How Mailinator Compresses Its Email Stream By 90% More Login

How Mailinator Compresses Its Email Stream By 90%

Because nobody RTFA (Score:5, Informative)

Re:Because nobody RTFA (Score:5, Informative)

Mailinator Rocks (Score:4, Informative)

It's not the algorithm, it's the data (Score:5, Informative)

Re:You cut off at the good part. (Score:5, Informative)

Re:It's not the algorithm, it's the data (Score:3, Informative)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot