Forgot your password?
typodupeerror
Communications Programming Spam IT Technology

How Mailinator Compresses Its Email Stream By 90% 75

An anonymous reader writes "Paul Tyma, creator of Mailinator, writes about a greedy algorithm to analyze the huge amount of email Mailinator receives and finds ways to reduce its memory footprint by 90%. Quoting: 'I grabbed a few hundred megs of the Mailinator stream and ran it through several compressors. Mostly just stuff I had on hand 7z, bzip, gzip, etc. Venerable zip reduced the file by 63%. Not bad. Then I tried the LZMA/2 algorithm (7z) which got it down by 85%! Well. OK! Article is over! Everyone out! 85% is good enough. Actually — there were two problems with that result. One was that, LZMA, like many compression algorithms build their dictionary based on a fixed dataset. As it compresses it builds a dictionary of common sequences and improves and uses that dictionary to compress everything thereafter. That works great on static files — but Mailinator is not a static file. Its a big, honking, several gigabyte cache of ever changing email. If I compressed a million emails, and then some user wanted to read email #502,922 — I'd have to "seek" through the preceding half-million or so to build the dictionary in order to decompress it. That's probably not feasible.'"
This discussion has been archived. No new comments can be posted.

How Mailinator Compresses Its Email Stream By 90%

Comments Filter:
  • Really Dumb (Score:5, Funny)

    by Anonymous Coward on Tuesday February 21, 2012 @06:16PM (#39117053)

    Is how I feel after reading the referenced article.

    • Yep... Compression algorithm designed with knowledge of what it's compressing does better than compression algorithm without... Also, the Pope is Catholic.

  • Because nobody RTFA (Score:5, Informative)

    by Tyrannosaur ( 2485772 ) on Tuesday February 21, 2012 @06:25PM (#39117139)

    The end result is that he made his own compression-for-emails, where it scans strings in every email and stores the same strings in memory, with the emails storing only pointers to the strings.
    For large emails (he says >20k as an estimate), he applies LZMA on top of that, with a sliding dictionary based on the emails from the last few hours or so.

    All in all a very good read for someone (like me) who has an interest in data compression but knows little about it yet. I like to read other people's thought processes.

    Reminds me of another good read I found in someone's ./ comment about "compressing" random data: http://www.patrickcraig.co.uk/other/compression.htm [patrickcraig.co.uk]

    • by MagicM ( 85041 ) on Tuesday February 21, 2012 @06:37PM (#39117297)

      If you enjoyed reading that, you might also enjoy reading this [ejohn.org] and the follow-up [ejohn.org] about efficiently storing a dictionary of words and dealing with memory v.s processing trade-offs.

      • I did very much enjoy reading that, thank you. The genius of many people trying to solve their own specific problems never ceases to amaze me. (I was also regrettably unaware of what a "trie" was until now- learn something new every day)

      • [ TP neglected to mention the ejohn articles are covering compression using Javascript/Node.js ]

      • Tries and Bloomfilters are wonderful algorithms, because they are simple, if you want something a tad bit more complicated use Locality-Sensitive Hashing [stanford.edu] to find similar documents from a big set of documents.

    • Comment removed based on user account deletion
      • I don't think he was ever trying to get anyone to believe he's not wrong, he seems to just be having a bit of fun by making a work-around that *almost* looks like it works

      • Part of the problem is that Mike Goldman makes as if to outline precise technical constraints on the problem (data file of such size, you tell me this, I send you that, you send me those, they output so-and-so) but includes without being explicit the spirit of the bet. The challenge is about compression, yes, but if you start to give precise constraints on how the bet can be won, you start to imply that any activity within the constraints is fair game.

        The confusion here is about the nature of human communi

  • Just delete the emails that are not on the compressing dictionary.

  • Mailinator Rocks (Score:4, Informative)

    by Jah-Wren Ryel ( 80510 ) on Tuesday February 21, 2012 @06:40PM (#39117309)

    I use mailinator all the time, it is fantasticly useful. Sometimes I encounter a website that won't accept mailinator addresses, some even go to the effort of tracking the alternate domains he uses and blocking them too. I find mailinator so useful that when a website refuses mailionator addresses, I just won't use that website.

    The Mailinator Man's blog is also pretty good, the guy is articulate and has a knack for talking about interesting architectural stuff. This latest entry is just another in a great series, if you like this sort of stuff and haven't read his previous entries you should take the time to read through them.

  • If I compressed a million emails, and then some user wanted to read email #502,922 — I'd have to "seek" through the preceding half-million or so to build the dictionary in order to decompress it. That's probably not feasible.

    What the summary does not say was that, email number 502,922 is special cased and is stored in plain text at the head of the compression dictionary. So it will trivially fetch email number 502,922.

  • That service is pretty cool. Never realized there was something out there like that.

  • by Hentes ( 2461350 ) on Tuesday February 21, 2012 @06:52PM (#39117429)

    Mailinator can achieve high compression rates because most people use it for registration emails. Those mails differ from each other in only a few words, making the data set highly redundant, and easily compressible.

    • by Lehk228 ( 705449 )
      how unique is most email? registration emails, newsletters, h3rb4l v14g|2a, p3n15 3nl4|2g3m3n7, buy stock ____, etc
    • Re: (Score:3, Informative)

      by firewrought ( 36952 )

      Mailinator can achieve high compression rates because most people use it for registration emails. Those mails differ from each other in only a few words, making the data set highly redundant, and easily compressible.

      The accomplishment here is that he determined a very tactical set of strategies for solving a real world problem of large scale. No, it didn't take a math PhD with some deep understanding of Fourier analysis to invent this algorithm, but it most certainly took a software developer who was knowledgeable, creative, and passionate for his task. So yeah... it's not the 90% compression that's impressive, it's the real-time performance that's cool.

    • by Tablizer ( 95088 )

      A common email phrase dictionary based on typical spam....I mean content, would resemble:

      * Viagra
      * prince
      * you won!
      * sexy
      * Nigerian
      * Please send your check to
      * Free free free!
      * $9.95
      * Survive Armageddon

      Just use compressed tokens for those and most spam, I mean emails, would be just a few bytes.

    • by isorox ( 205688 )

      Mailinator can achieve high compression rates because most people use it for registration emails. Those mails differ from each other in only a few words, making the data set highly redundant, and easily compressible.

      Which reminds me, Facebook is backed up onto a single LTO-5

  • Paul Tyma, creator of Mailinator, writes about a greedy algorithm to analyze the huge amount of email Mailinator receives and finds ways to reduce its memory footprint by 90%. Quoting: 'I grabbed a few hundred megs of the Mailinator stream and ran it through several compressors. Mostly just stuff I had on hand 7z, bzip, gzip, etc. Venerable zip reduced the file by 63%. Not bad. Then I tried the LZMA/2 algorithm (7z) which got it down by 85%! Well. OK! Article is over!

    TFA right there.

  • I run a similar (though waaaay less popular) site - http://dudmail.com/ [dudmail.com]
    My mail is stored on disk in a mysql db so I don't have quite the same memory constraints as this.

    I had originally created this site naively stashing the uncompressed source straight into the db. For the ~100,000 mails I'd typically retain this would take up anywhere from 800mb to slightly over a gig.

    At a recent rails camp, I was in need of a mini project so decided that some sort of compression was in order. Not being quite so clever I

  • If anybody responsible for the site comes this way, thank you for the excellent (and free) service.

Suggest you just sit there and wait till life gets easier.

Working...