Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
Spam The Internet Technology

Seven Spam Filters Compared 213

Goo.cc writes "Those wondering how their spam filtering software performs in comparison to other's may want to read this article on Freshmeat, where Sam Holden performs comparative testing of various popular e-mail filters. The filters tested includes Bayesian Mail Filter, Bogofilter, dbacl, Quick Spam Filter, SpamAssassin, SpamProbe, and SPASTIC."
This discussion has been archived. No new comments can be posted.

Seven Spam Filters Compared

Comments Filter:
  • by Anonymous Coward on Saturday August 23, 2003 @01:56PM (#6773891)
    Sounds great, but until I hear about software products like these in my morning mailbox, I don't really trust that they're any good.
    • Hmmm, perhaps we should send this story to everyone we know, everyone on usenet and everyone listed in all those online directories? I'm sure they'll all appreciate an article that'll help them cut down on the amount of spam that they receive.

      Yeah, I'll think I'll be a good samaritan and do that ASAP. Now, where's that open relay...
  • by TexTex ( 323298 ) * on Saturday August 23, 2003 @02:00PM (#6773930)
    The author makes a good attempt at comparing these products, but I don't think his samples are indepth enough to come up with real-world results.

    For Bayes testing, he used 68 spam and 68 ham messages. Spamassassin for one won't even activate bayes until it's learned from 200 messages; it's not uncommon for those who regularly deal with spam management on the server side to use 5000-10,000 message corpuses to test new rule additions and to train spam.

    The low number might have a slight effect if most of your mail contains similar characteristics, but I'd much rather have seen bigger numbers of samples.

    • by cly ( 457948 ) <myspampot@@@yahoo...com> on Saturday August 23, 2003 @02:13PM (#6773980)
      I guess you wrote this after reading the first two experiments.

      In the third he used 1200.

      Nice way to jump the gun.
      • by arth1 ( 260657 ) on Saturday August 23, 2003 @03:03PM (#6774213) Homepage Journal
        I guess you wrote this after reading the first two experiments.

        In the third he used 1200.


        1273, out of which 1073 were spam. That leaves 200 non-spam messages, which isn't enough for Spamassassin's bayesian filtering to kick in, even if all messages were to be classifed as ham or spam, and not just let through.

        To quote sa-learn's man page:
        Another thing to be aware of, is that typically you should
        aim to train with at least 1000 messages of spam, and 1000
        ham messages, if possible. More is better, but anything
        over about 5000 messages does not improve accuracy signif&#173;
        icantly in our tests.
        The low number of emails, combined with no apparent manual reading on part of the author, makes me want to disregard this whole survey as pure drivel.

        Regards,
        --
        *Art
        • by hamster foo ( 697718 ) on Saturday August 23, 2003 @04:03PM (#6774449)
          "Also, SpamAssassin has a Bayesian classifier built in, but it wasn't used in these tests, since having five was enough."

          While I'm sure the recommendations set forth in Spam Assassin's man page are probably a good idea for all Bayesian training sets, he wasn't using the Bayesian filtering included in Spam Assassin, so you can't really fault him for not reading a section of the man page for a feature he was choosing to leave out.

          It would have been nice to see him turn on Spam Assassin's Bayesian filtering at least in some of the tests. I don't think test results with a feature I would imagine the vast majority of users would used turned off is a very good comparison of the different packages abilities.
        • Being so data-hungry is a potentially crippling disadvantage to the bayesian approach. Anything that requires 1000 messages of each type just to get started is useless to lots of people. It would take me half a year to prime the thing on my home email address.
        • Later on, he criticized sa-learn's manual for indicating that you should have an even ratio of spam to "ham", which was not born out by his tests. Obviously, he did read the manual.
        • I quoted from a section a bit further down in that manpage - which indicates I just might have the read the damn thing.

          That spamassassin has a limit that my sample data didn't reach isn't of real concern to me. I can't just magically create some emails, I only have the emails that I have recieved.

          A bayesian filter should work reasonably well with unbalanced training data. A Paul Graham style "let's ignore the huge amount of research in the field and make stuff up" filter will have problems because it igno
    • by Sanctuary ( 124701 ) on Saturday August 23, 2003 @02:18PM (#6774002)
      They didn't train Spamassassin to use the bayes filter once during the test, and they used it with out all the other scoring tools for Spamassassin. This review really didn't completely test Spamassassin's full potential.
      • After training, SA does a good job, but really, it catches only around 90% of my spam. I do every once and while, sa-learn spam on my spam folder and ham on my mailing lists/work folders, to keep it updated. I've also put the blacklists in the global prefs to keep it updated. Still 90% reduction is nice.

        But, Thunderbird catchs the rest, hardly any spam makes it through now.

      • by skookum ( 598945 ) on Saturday August 23, 2003 @03:53PM (#6774414)
        Agreed. The author made up the artificial constraint that "no program is allowed to contact the network" which means that SpamAssassin wasn't able to check the DNS blacklists for things like exploited open proxies/relays in the Received chain, or to check with distributed signiture services like RAZOR/DCC, etc.

        If you're not going to let the program use its full capabilities, why test it?

        Analogously, what kind of hardware review site would do a review along the lines of "This motherboard supports this extra feature that will improve CPU speed noticeably, but we're going to disable it for our tests (even though most of you would want to use it.)"
        • Seems to me like it isn't an artificial constraint, but merely a practical one. It sounds like he scripted the programs to run through his data all at once, so querying the online resources a thousand times an hour would not be feasible. The Bayesian filters were at a similar disadvantage because of the automated testing: normally, each false negative gets added to the spam corpus, which would haved improved their accuracy over time.
          • In the case of the RBL (realtime block-lists), they use the existing infrastructure of DNS and so the load is fully distributed and cached. The first time you make a query for the status of a given IP address, you'll probably end up getting a response from one of the authoritative nameservers, but all subsequent queries for the same name will be cached without any extra burden on the nameserver. Additionally, there are usually many slave/secondary nameservers for the main RBLs, so load is not too much of
        • The testing was done a month after the actual emails were recieved. Using such resources would allow the filter the benefit of hindsight. As in foo sends lots of spam and ends up on a blacklist, but I recieved a bunch of spam from foo before it got on the list.

          So it wasn't artificial. I mentioned in the article why I made that constraint.

          I also didn't retrain bayesian filters on false-negatives before giving them later emails, which isn't normal use of them either.
  • by ceswiedler ( 165311 ) * <chris@swiedler.org> on Saturday August 23, 2003 @02:02PM (#6773939)
    IMO, the best way to go with spam is to combine a heuristic filter with a text/baysian filter, in my case SpamAssassin and SpamProbe. I run them both, and it does a noticably better job than either running alone.

    SpamProbe can be fooled by clever spammers who insert lots of common words in non-visible html. A Baysian filter can't really catch that, but a heuristic filter can be written to notice the pattern.

    Also, set up your Baysian filter to re-learn regularly from your spam folder. SpamProbe adds a unique ID to each message, so it won't process a message twice. Therefore, you can just manually move any false negative spams into the folder, and they'll be learned from.
    • WRONG. (Score:5, Informative)

      by imsabbel ( 611519 ) on Saturday August 23, 2003 @02:38PM (#6774099)
      Of couse your baysian filter will QUICKLY learn that html tags that create invisible text are VERY common in spam and nowhere else-> problem solved
      Dont forget that the filter sees more than the eye...
    • Simply adding random text to a message is not enough to get it past SpamAssassin.

      I run SpamAssassin, I know that it catches that stuff.

      The reason it does catch it is because it used a WEIGHTED system for classification. If the message has the characteristics of spam, but has random words in it, it will still be considered spam UNLESS those random words have been used previously in ham messages that it has learned.

      Now, the odds of the spammer hitting upon words that my version of SpamAssassin has learned
    • Comment removed based on user account deletion
    • SpamProbe can be fooled by clever spammers who insert lots of common words in non-visible html.

      Well it's not that clever, I've configured SA to mark obfuscated mail +20, so it's always caught immediately. The only people using this feeble trick are spammers, so there's no likelyhood of a false positive...
  • Comment removed (Score:5, Insightful)

    by account_deleted ( 4530225 ) on Saturday August 23, 2003 @02:02PM (#6773942)
    Comment removed based on user account deletion
    • Re:Mozilla? (Score:2, Insightful)

      by bobintetley ( 643462 )
      Sensible people filter their email at the server and try to waste as little bandwidth as possible.

      Mozilla is no good for this, as you have to download the mail via POP3/IMAP to filter it.

      Don't get me wrong - Moz' spam filter is good at the user level, but you really would want to try and ditch the spam before then (particularly if you run a server for a number of users).
      • Re:Mozilla? (Score:2, Informative)

        by thinkninja ( 606538 )
        Very true. I downloaded 1600 messages with Thunderbird today (backlog) and only about 30 weren't spam. That's a huge waste of bandwidth.
      • Re:Mozilla? (Score:5, Insightful)

        by wilfie ( 622159 ) * <willm.avery@gmail. c o m> on Saturday August 23, 2003 @02:25PM (#6774033) Homepage
        The loss of bandwidth is not the main cost of spam these days.Certainly not internal bandwidth between our mail server and desktops. The excellent features of doing it on my desktop are that the filter is learning about what _I_ consider to be spam and ham, and that I have the stuff that's classified as spam to hand and can check it through once in a while. So far for me it's only thrown false positives when colleagues have sent stuff that was spammy in content. I have a presentiment that our CEO's habit of writing in red HTML (full of ff0000) will cause a false hit one day.
      • Re:Mozilla? (Score:5, Insightful)

        by hdw ( 564237 ) on Saturday August 23, 2003 @02:39PM (#6774101)
        Most people can't filter their email at the server, since most people doesn't have access to a server to filter at.

        So the majority has to filter locally, either in the client or with a local pop/imap proxy (like PopFile).

        // hdw
    • Mozilla's bayesian spam classification is a direct implementation of the original "Plan for Spam" algorithm. However, it is broken in a number of ways, both in terms of classification and problems with training. I have filed bug reports for all of the problems and some are being worked on. As it stands right now Mozilla performs worse than any other Bayesian filter and probably on par with the 'crippled' SpamAssassin used in this test.
      • Re:Mozilla? (Score:3, Informative)

        by Blain ( 264390 )

        I have been using POPFile for months now, with a fairly complex setup, one of the things I like about POPFile versus the others I've seen (which are two or three bucket systems). It's classifying more than 99% accurately every month for the past three or four months (I reset my statistics around the first of every month) and has never been less than 95% accurate in a month (including its training month). For an idea of what my loads and buckets are like, this list of my buckets and the number of messages

  • OT: Disturbing? (Score:4, Insightful)

    by Lead Butthead ( 321013 ) on Saturday August 23, 2003 @02:08PM (#6773958) Journal
    Does anyone find it disturbing that --

    a. Spam Filter software company is now a "viable business."
    b. Spam Filer is needed AT ALL?
  • Flawed Tests (Score:4, Informative)

    by Plix ( 204304 ) on Saturday August 23, 2003 @02:14PM (#6773984) Homepage
    As was noted earlier, the set of messages given to the filters for learning was terribly small. Furthermore, SpamAssassin wasn't tested in a way useful to most as the tests in this article didn't take into account SA's Bayesian filter nor it's network-based tests (Razor, etc).
  • by menscher ( 597856 ) <[menscher+slashdot] [at] [uiuc.edu]> on Saturday August 23, 2003 @02:21PM (#6774016) Homepage Journal
    Spamassassin > v2.50 supports Bayes, right? But TFA seems to imply that it's just heuristic. I'd be interested in seeing how spamassassin improves with a good training set.

    Also, what's with keeping the spam threshhold score secret?

    • He mentioned in the article that he disabled the Bayesian capabilities in SpamAssassin because there were already five other Bayesian based tools in the comparison.

      I think he should have at least included a "full powered" spam assassin into the testing.. Which technology is best is an interesting test to perform. But, I'm really only interested in which application to install to kill spam.

      I have been using Spam Assassin for a few months now, and find it to be excellent.

      For my corporate mail, where I
    • by numbski ( 515011 ) * <numbskiNO@SPAMhksilver.net> on Saturday August 23, 2003 @02:36PM (#6774092) Homepage Journal
      Yup. I use it all the time. Save up spam and ham in seperate folders. Then do this:

      sa-learn --spam --mbox ~/mail/myspamfolder
      sa-learn --ham --mbox ~/mail/myhamfolder

      As I get more spam, I set it aside into a folder, and in tcsh I have this alias set:

      alias spamadd 'sa-learn --spam --mbox ~/mail/got-through && rm ~/mail/got-through && touch ~/mail/got-through'
      • As I get more spam, I set it aside into a folder, and in tcsh I have this alias set:

        alias spamadd 'sa-learn --spam --mbox ~/mail/got-through && rm ~/mail/got-through && touch ~/mail/got-through'

        In addition to the above, it might be smart to create three files called "ham", "spam" and "forget":

        #!/bin/sh
        # ham
        /usr/bin/sa-learn --ham --no-rebuild --single

        #!/bin/sh
        # spam
        /usr/bin/sa-learn --spam --no-rebuild --single

        #!/bin/sh
        # forget
        /usr/bin/sa-learn --forget --single

        Complement with a c

  • Active Spam Killer (Score:3, Informative)

    by Admiral Llama ( 2826 ) on Saturday August 23, 2003 @02:24PM (#6774031)
    How the heck could Active Spam Killer [paganini.net] be left out? I used to get about 150 spams a day and now I get ZERO. No false positives, no false negatives.
    It is an autoresponder that checks the sender against a whitelist and a blacklist. If a new e-mail is in neither, then it bounces back an e-mail asking for a confirmation that the sender is a human. Simple!
  • by SuperBanana ( 662181 ) on Saturday August 23, 2003 @02:25PM (#6774038)

    I noticed immediately that the author turned off SpamAssasin's Bayesnian filter, claiming "it already has 5 points, that's enough". WTF does that mean? The whole point of SpamAssasin is to do a lot of tests, and add the scores together- and then set the threshold you want(something he also doesn't modify- I changed my threshold after looking at the scores spams were getting and such.)

    I trained SA's bayesnian filter off of about 3 years of spam and legitimate email sent directly to me. SA as a whole is working nearly flawlessly- the only messages it has tagged as spam were those from users with improperly configured email clients AND suspicious email addresses AND using only HTML. Ie, a message that would damn well look like spam. However, like I said, I lowered SA's threshold by 2 points because I was having too many false positives(that was before I had properly trained the Bayesnian filter, so perhaps I'll kick it up a point now.)

    One important note- when you get a falsely classified message, it's REALLY important to tell Spamassasin's bayesnian filter about it. It's as easy as cut+paste if you do sa-learn --spam/--ham --single, hit enter, paste the message, hit control D. Done!

    • by Anonymous Coward

      However, like I said, I lowered SA's threshold by 2 points because I was having too many false positives(that was before I had properly trained the Bayesnian filter, so perhaps I'll kick it up a point now.)

      I use SpamAssassin with the flag threshold set at 5, the default. I have procmail send any message from 5-10 into a spam mailbox which I clean out occasionally, and messages at 10+ straight to /dev/null (after a couple of months of also keeping those in the spam mailbox).

      Having a properly trained Bay

    • by Sits ( 117492 ) on Saturday August 23, 2003 @04:09PM (#6774472) Homepage Journal
      Here's a quote from the article:
      Also, SpamAssassin has a Bayesian classifier built in, but it wasn't used in these tests, since having five was enough.


      If you reread the slightly ambiguous sentence in context you will realise he meant he had evaluated five baysian filters and felt that was enough. Nothing to do with Spamassassins point system...
    • In my case, SpamAssassin run at my University's CS department [ira.uka.de] has been working extremely well for me, even better since they updated to use Bayesian filtering. My statistics since 2003-03-05, i.e. for the last 174 days:

      • 3324 True positives
      • 88 False negatives
      • 0 False positives
      • Somewhere around 2500 "True negatives"; though some mailing lists I receive are effectively whitelisted since mails are sorted in their respective IMAP folders by their mailing list affiliation before being filtered by their Spam stat
  • What? No PopFile? (Score:4, Interesting)

    by MrEnigma ( 194020 ) on Saturday August 23, 2003 @02:28PM (#6774054) Homepage
    They started off by quoting John-Graham Cumming, et they didn't include his brainchild PopFile.

    Check it out Here [net.com].
  • What About PopFile (Score:5, Informative)

    by MBCook ( 132727 ) <foobarsoft@foobarsoft.com> on Saturday August 23, 2003 @02:29PM (#6774061) Homepage
    What about PopFile? I've tried SpamAssassin and a few others, and I like PopFile the best. After a little training it's EXTREEMLY accurate. It survived the deluge of mail I've gotten in the last few days (due to virii) with flying colors.

    According it it's internal statistics, it has classified 2821 messages as of the time I type this. It has made only 95 errors (often close calls, so I don't blame it). That puts it at an accuracy of 96.63%. For the record, of the e-mail I've gotten, it's 308 messages of ham, 2513 spam.

    I have only been using PopFile since June 7th of this year, but it's working fantastic. The only thing I've used that's this good was Cloudmark's SpamNet, who stabbed the community in the back, so I switched to something else. I'm glad I've found PopFile, and I suggest you try it too if you're looking for something good.
    • by jedrek ( 79264 )
      I use PopFile as well and am equally satisfied. I make sure to reclassify all false negatives and positivies. Accuracy is at 97.65%, I've gotten 2,802 spams for 5,432 mails I've gotten since I installed it.

      When me and my friend had a site featured on Yahoo, USA Today, NYT, etc. the spam just went THROUGH THE ROOF. But, thanks to PopFile I didn't have to see any of it.
    • Messages classified: 6,116
      Classification errors: 88

      ---

      Accuracy: 98.56%

      And THAT is with 8, yes EIGHT, different buckets for sorting my mail. Of course 79% of my mail is spam so :)
    • Messages classified: 3,545
      Classification errors: 110

      Accuracy: 96.89%

      This is with 4 buckets. My spam bucket received 2,561 ( 72.24%) of those e-mails, with 7 false positives and 9 false negatives.

      Oh yeah, POPFile is cross-platform...Windows, Linux, anything that will run Perl (Windows users, don't be afraid. The installer installs an interpreter for you - you'll never know it's there!)
  • PSAM (Score:5, Informative)

    by po8 ( 187055 ) on Saturday August 23, 2003 @02:29PM (#6774063)

    See our PSAM [pdx.edu] project site for a refereed paper evaluating several machine learning spam filtering techniques (although not specific filters). This site also contains large standardized corpora for evaluation. The paper contains a number of tips on evaluating ML spam filters.

    The /.-referenced article has some good ideas about evaluation. I particularly liked the explicit discussion of the false positives. The recommendations at the end are excellent. On the other hand, the evaluation isn't across a broad or obviously representative corpus, many of the tests are a bit odd, the ROC tradeoffs are not discussed. In particular, the evaluation set for the tests did not include enough ham to be able to accurately estimate the false positive rate: consider what would happen to the precision estimates if 0.5 were added to each of the numbers in the false positive table.

    Overall, though, this was an interesting evaluation, and I'm glad that the author published it.

  • by Tablizer ( 95088 ) on Saturday August 23, 2003 @02:37PM (#6774095) Journal
    That's right! Our company has found a high-tech way to use various anti-spam tools to enlarge your penis. My pennis is noww sso lrage that i Cannnot type curretcly. Itt gtes in teh way.

    Please visit www.spamfilters2enlarge.com

    Act before midnight and get a $30 discount.
  • by bigberk ( 547360 ) <bigberk@users.pc9.org> on Saturday August 23, 2003 @02:42PM (#6774118)
    If you decide to try out spamprobe or another bayesian filter, try this web interface [pc-utils.com] which lets you easily reclassify mail, even those marked as spam. I found that "training" the bayesian filters was the hardest part; this definitely simplifies the process.
    • Why not simply use POPFile [sourceforge.net]? It has a very nice web interface that makes it very easy to reclassify false positives or false negatives. It also supports multiple email categories; unlike most other Bayesian filters, you can filter email as more than just "spam" or "not-spam". I myself have five classification groups: spam, mail from my university, two mailing lists I'm on but don't read actively, and "real" email. It works very well in concert with Mozilla, and my email is automatically directed to the appro
  • Off topic but... (Score:2, Informative)

    by CGP314 ( 672613 )
    It wasn't mentioned in the article, but I really must plug popfile. [sourceforge.net] It filters out my spam yes, but it is also a general mail categorizer. It sorts ten yahoo groups for me, personal, work, and school related emails. I know you think you could do this with rules for the emails, but for example, I get several hundred emails a day from the Harry Potter for Grownups List. [yahoo.com] Popfile can sort them into 'probably interesting' and 'probably not' for me. Very nice.
    • I use POPFile, and I really like it, but it's not quite ideal. It's very best feature is it's configurable buckets, which aren't just limited to ham and spam. I have buckets for "personal," "mailing list," "automatic," and "spam." One could get even more creative, with something like an "interesting" bucket.

      What I really want is something with a more generic interface. POPFile's POP3 proxy and webserver interface mostly limits it to email. I'm thinking of starting a project to make a generic text-clas

  • by pongo000 ( 97357 ) on Saturday August 23, 2003 @02:46PM (#6774137)
    An interesting thread here [libertine.org] about how TMDA [tmda.net], a C/R filter, used in conjunction with SpamAssassin, can provide the best of both worlds. While TMDA is by itself effective, there seem to be some humanistic issues involving the assumption that all e-mailers are spammers unless they prove otherwise. The thread explains how Bayesian filtering can be improved by using a decent C/R filter like TMDA without alienating people that send legitimate e-mail.

    Personally, I figure anyone thin-skinned enough to be insulted by my C/R filter probably isn't worth talking to anyways, but I digress...

  • The quickest way to stop spam in the U.S. would be to have a respected person such as the Surgeon General of the United States say that

    1) There is no way to increase the size of your body parts,

    2) The cheap Viagra is not Viagra,

    3) and so on.

    We can help by telling everyone we know not to buy anything from spam. Next time you are at a party or family gathering, make that point.

    Spam would disappear if there were no buyers. We need to make it culturally unacceptable to buy anything that is adv
    • Nice idea in theory. Unfortunately I suspect it would have even less effect on the spam situation than the "Cigarettes may damage your health" warnings on cigarette packs. Let's face it, given the rate of reduction in smoking when your health is at risk, perhaps even your life as a result of Surgeon General warnings, what effect do you think this is going to have on the typical male with adequacy issues?
    • > The quickest way to stop spam [...] say that [...] 1) There is no way to increase the size of your body parts, 2) The cheap Viagra is not Viagra,

      Unfortunately, you risk that people just remember "cheap viagra" and "increase the size", with the opposite effect as a result.

      In Netherlands, there is or may was an urban legend that a big tea brand will donate a wheelchair to whoever gathers one million tea bag labels of that brand. Presumably, the tea brand tried informing the world through advertisemen

    • Never overestimate stupid people.

  • by Anonymous Coward on Saturday August 23, 2003 @02:58PM (#6774187)
    As a professional sender of UCE, I just want to tell you slashdotters to keep on playing with your spam fileters. As long as you use spam filters on your e-mail, I can continue to reach my real intended targets, those non-slashdotters who do not know better and will buy my products or click through to my client's websites. You filters really help cut down on the complaints to the internet service providers I do business with, and as long as not too many complaints come in their marketing people assure me we can do business. Of course, I still waste your bandwidth and mailbox capacity, but you no longer complain to uce@ftc.gov, my access providers, or anyone else who might cause me problems. My yahoo and hotmail and other accounts for replies are lasting much longer before getting shut down because someone complained to these service providers. And my clients are even reporting that they can start mailing out 800 numbers like 1-800-901-3719 again and they will not have you damn spammers set up their modems to keep autodialing them, since you spend your own time and effort to filter the e-mail and only clueless users who might actually call see the numbers.

    Please don't bother your Congressmen or Senators proposing legialation that might not work 100%. Just keep on filtering the spam I send you, I know you would have never bought from me anyway. That you can filter ligitimizes my business and my waste of your bandwidth.

    P.S. To be sure of not getting a false positive , be sure to send all filtered mail to a special folder. Waste your storage space storing the mail until you manually go through every piece to be sure you didn't accidentally filter something important. Of course, this will take exactly as much effort as it would have to just check the e-mail when it first came in, not to mention the extra effort spent in setting up the filters and the extra space for storing your incoming spam folder, but what the heck. You geeks enjoy wasting time this way, and I certainly appreciate it. It makes the work of all us spammers much easier.

    • My time and the time of 100,000 users is not.

      And since the stuff like the spam filters are getting pretty generic, they can be configured and replicated to numpty users reducing spamming effectiveness by several orders of magnitude.

      Poor attempt at irony BTW.

    • This might be considered interesting, but I think it is really just a troll.

      However, one interesting point that trollboy makes, is that the 1-800 numbers end up in the spam, and we don't see them: why not modify the filter so it automagically pulls out all such numbers from the spam, so that they can be easily on hand for those people who want to set up autodialers? In a way this is poetic justice, being analogous to the way the scumbag spammers harvest email addresses from web pages. So yet again, the cl
      • why not modify the filter so it automagically pulls out all such numbers from the spam, so that they can be easily on hand for those people who want to set up autodialers

        Because this is unthinking vigilantism, real pitchforks and torches stuff, and spammers will just use your wrath to launch joe jobs against anti-spam companies and individuals.
    • So, you're saying that filtering spam helps spammers? I don't buy it. For one, it makes it much easier to complain, if you're so incined. Filtering spam doesn't legitimize it any more than locking your house legitimizes stealing. Spam filtering also has the effect of minimizing the unbridled rage spam causes, which will cut down on reactionary legislation. I think that's a good thing, because governments in general don't understand the Internet but aren't afraid to meddle with it.
  • by BrookHarty ( 9119 ) on Saturday August 23, 2003 @02:59PM (#6774194) Journal
    Dont know why we didnt see Mozilla's filters (Maybe thats covered under Bayesain filters?)

    I'm using the standalone Thunderbird and it catchs everything that passes by Spamassassin. Spam is marked but never deleted, so I can go back and check. Some spam programs will delete email, which could delete a good email, unacceptable.

    Basically, I'm using a mandrake linux box, imap, procmail, fetchmail and spamassassin. Easy, and I can send/receive email from my linux box, and port 25 is blocked from the Net so nobody can use me as a bouncer.

    Only problem I had was, there was no complete document to set this up, I had to piece each part together.

    So for anyone who wants to know, heres the quick steps.

    1. I'm using mandrake, but had to update SA for the sa-learn utils. (Gotta train SpamAssassin)
    2. Setup fetchmail in your personal account.
    3. Setup .procmailrc in your home dir

    DROPPRIVS=YES
    VERBOSE=ON
    LOGFILE=/home/useracc ount/procmail.log

    :0fw

    | /usr/bin/spamc
    4. Setup your user_prefs in your local directory for SA. (mine, but im no SA expert, but it works)
    required_hits 5
    rewrite_subject 0
    use_terse_report 1
    report_safe 1
    use_bayes 1
    auto_learn 1
    ok_locales en
    use_pyzor 1
    pyzor_max 9
    pyzor_add_header 1
    use_razor2 1
    always_add_headers 1
    always_add_report 1
    spam_level_stars 1
    pyzor_add_header 1
    skip_rbl_checks 0
    #timelog_path /home/useraccount/.spamassassin/timelog

    5. As root make sure Imap,Spamassassin is running.
    6. Load Thunderbird, use Imap, use filters on x-headers.

  • Anyone care to point out a decent way to use SA's bayesian filter with this setup:

    I have a linux box running as my web/mail server that has spamassassin on it for anyone who wants to use it (setup .forward and .procmailrc to do this). I'm currently deleting spam (score = 5)

    The problem is how to get spam and ham from Outlook back to the linux box correctly. To my knowledge, outlook doesn't export mail in any way that's readable by the sa-learn script. I'd like to setup a bayesian filter, but it seems like
  • by RNLockwood ( 224353 ) on Saturday August 23, 2003 @03:21PM (#6774286) Homepage
    I use SpamBayes (free) with Outlook on my W2K machine. I trained it with over 400 SPAM and over 1000 non-SPAM emails. I get about 45 SPAM each day and my ISP, attglobal, filters out about 40 of them. The SPAM that gets to my mailbox are the ones that pass through the attglobal filter and that filter has NEVER given me a false positive for more than 2000 SPAM. Those SPAM are put in special folder on the server for inspection but I now just delete them en-mass every week or so.

    That means that SpamBayes is filtering only the hardest emails to classify and so far it has only given me one false positive. I got one false negative after training it for the first time. SpamBayes also has a folder for messages that it is not sure of and so far they have all been SPAM. I seldom have to do more than inspect the sender and subject to confirm that they are SPAM.

    Each time a message is automatically moved to the SPAM folder (or moved back to the Incoming folder) the training set is adjusted for that email so I don't have to re-train.

    To sum up I'm really impressed by well designed Bayesian filters and this one in particular. I think it's worth while to take the time to build up a corpus of SPAM and "good" messages as I can then evaluate competing filters.

  • by herrd0kt0r ( 585718 ) on Saturday August 23, 2003 @03:38PM (#6774345)
    since the filters do better after being trained with lots of spam, anyone think of gathering up a huge collection of spam to give to other people? i mean exporting a corpus of spam from outlook, sticking it up for download somewhere, and letting other people import it into a spam folder. then other people could run their filter of choice and train it!

    you could even make it all official-like, and somehow guarantee that the spam that's up for downloading is "official" and "virus-free" and "safe for your computer." you know, do geek stuff like check hashes or whatever it takes to verify that the spam collection is legit. whatever it takes to ensure that someone else hasn't filled it with a ton of virus/trojan/etc. attachments. or whatever. i dunno. you know, somehow guarantee it's safe.

    imagine it! download spambayes, get spambayes to connect to the official spambayes spamcorpus server, and download the latest 2000 spams! instant training.

    anyway. just an idea. mod me down as -1, herrd0kt0r. 8P

  • by Stavr0 ( 35032 ) on Saturday August 23, 2003 @03:48PM (#6774381) Homepage Journal
    Ratings - Spam-blocking software [consumerreports.org]

    SAProxy for Windows [bloomba.com] (Based on SpamAssassin) got the highest marks.

  • i use apple's mail.app with bayesian filtering. i have received maybe 4 or 5 true spam emails in over a year. i haven't yet missed any real emails either. i would have to say that's pretty good. otoh, our groupwise system at work is fscking horrible. i get tons of fscking spam. i have had to set dozens of rules, and it still doesn't matter.
  • Some comments (Score:3, Interesting)

    by zaad ( 255863 ) on Saturday August 23, 2003 @04:11PM (#6774489)
    I'm not disagreeing with the posters that stated that he has low sample size. It might be one of the problems why he doesn't have a higher catch or recall rate.

    The main problem I see with bayesian filters is that they are complicated and nontrivial to set up. I've been playing with Bogofilter for several months. And even with sub 1000 corpuses, I get a very high catch rate (greater than 90-some %, though I don't have exact numbers).

    The method that I've employed is start with a small set of three hundred or so ham and spam corpuses, then to train on error over time. It's a pain in the ass because I still have to continually inspect the results and tweak the databases.

    In addition to that, there are at least a half a dozen parameters that contribute to the success or error rates. So much so that bogofilter actually comes with bogotune to analyze the corpuses to suggest optimal parameters.

    So give the guy a break. I wouldn't say his results are robust enough for an academic publication, but it isn't worthless. It's interesting enough for a read. It's more work than many of us are willing to do.

    Also an interesting read is Comparing Bayes Chain Rule with Fisher's Method for Combining Probabilities [www.bgl.nu].
    • So much so that bogofilter actually comes with bogotune to analyze the corpuses to suggest optimal parameters.

      Correct me if i'm wrong but I think bogotune has nothing to do with success/error rates. It deals with the berkely db backend for speed.
  • I've got a related question that doesn't rate "Ask Slashdot" status, so I'll ask it here...

    I use IMAP to read my mail, mostly because that makes it easy to read from both work and home, and occasionally when I'm on the road. Right now I'm using the bayesian filter in Mozilla. It's great, but since it's client-based that means I have three seperate filters I need to train. Sometimes I'll run into weird problems where two of the filters think an email is good but the third thinks its spam. If I accident

  • Surely this article should have been written by Spam Holden?
  • by Henry Stern ( 30869 ) <henry@stern.ca> on Saturday August 23, 2003 @04:59PM (#6774730) Homepage
    Sam's article was a very interesting read, but his results need to be taken with a grain of salt.

    To show that one piece of software outperforms another, you need to prove statistical significance. This can be done in two ways:

    The first method is called the pairwise t-test. What you need to do is to run k tests using different training and test data. For each of these tests, you find the accuracy of the classifier (#success/#trials). The, you form the "t-statistic," t = d/sqrt(sigma_d^2 / k), where d is the difference of the means of the two classifiers, sigma_d^2 is the variance of the difference samples and k is the number of samples. Then, you compare your t-statistic to the Student's distribution with k-1 degrees of freedom. Typically, you want a confidence level of 90% or 95% so you find the number of standard deviations away from the mean for the specific t-test (e.g. the 90% statistic 9-degree of freedom t-test is 1.38). If your t-statistic is greater than the number of standard deviations, then the difference between the two classifiers is statistically significant with X% confidence. Read more about this in Witten and Frank's Data Mining book.

    The other method is called Analysis of Variance (ANOVA). I'm not familiar enough with this method to explain it here, but it allows you to choose from a set of experiments which ones really are above the average. Dig around in your statistics books or on the web for more information.

    Sam should have made use of either of these techniques when doing his analysis. Since he only ran one experiment per configuration of his classifier, you can draw no real conclusions from the data presented (it's a Student's distribution with 0-degree of freedom... essentially flat!).

    Since most of us only have a small number of corpora kicking around (maybe even only one!), you can use a method called "cross validation" to give yourself a larger number of data sets than you actually have. When doing a cross validation, you divide your corpus up into k "folds" and then perform k experiments. In each experiment, you set aside one fold of your data for testing and train on the other k-1 folds. Since you're using different test data each time, each experiment can be considered to be different and then you can use a pairwise t-test to prove statistical significance. There are other methods that you can use such as "leave one out" where you have as many folds as you do pieces of training data and "bootstrapping" where you sample your training data with replacement and test with whatever wasn't sampled for training.

    However, cross validation may not be appropriate for incremental learning algorithms if your data is on a timeline (such as e-mail). You can break your corpus up into pieces and do your evaluation on that.

    Proving statistical significance is very easy and allows you to be confident in the conclusions that you make in your publications. It's the scientific method!

    Good luck!

    Henry
    • I did a ten fold cross validation.

      I even did some stats stuff and found that there was a significant performance difference between some of the filters - but I don't trust my stats knowledge enough to publish such things without getting them checked. SInce I didn't get them checked, I didn't include them.

      If the article was meant for a machine learning journal then obviously it's a joke. But it wasn't it was meant for freshmeat, the requirements are much lower.
  • Does anyone know if it is possible to overtrain a bayesian spam-filter? It would seem that this could potentially be a problem...

    /joeyo

  • by Stinky Cheese Man ( 548499 ) on Saturday August 23, 2003 @05:40PM (#6774944)
    I use bogofilter, and it seems to me it would take far too much of my time to manually feed my own spam to it for training purposes. What I do instead is this:

    We have several spamtrap addresses on our sendmail server. They were not intentionally set up as spamtraps, but in looking at my mail logs I noticed that there were many email addresses receiving spam attempts that are not and never were valid addresses on our system. These invalid addresses somehow got into spammers' email databases and they receive nothing but spam.

    So I set up entries in my aliases file to automatically redirect all mail for these accounts to bogofilter's spam database. Here is a sample...

    nikola: "|/usr/local/bin/bogofilter -s "
    cal: "|/usr/local/bin/bogofilter -s "
    bwilson: "|/usr/local/bin/bogofilter -s "
    fayre: "|/usr/local/bin/bogofilter -s "

    (If you are also using sendmails access.db to filter mail based on the source IP address, you may want to set up the spamtrap addresses as "spam friends" so that spam directed to them is not filtered out by your IP address filters.)

    To keep the spam database fresh and to keep it from growing to an excessive size, I use a daily cron job that automatically deletes spam entries older than 30 days...

    # remove records older than 30 days from spamlist.db
    /usr/local/bin/bogoutil -a30 -m /home/bogofilter/spamlist.db

    This gives me an 8 Megabyte spamlist.db with about 14,000 emails in it which is constantly refreshed to keep up with the latest spam trends.

    Maintaining the non-spam database isn't quite as easy. I use bogofilter's -u option on my own incoming email, which tells Bogofilter to update its databases with my incoming mail based on its classification of the message as spam or non-spam. I never get a false positive, but I do occasionally get a false negative which requires me to make a correcting entry in the database.
  • So the results aren't quite up to date. I've trained it on a couple months of spam and non-spam and it seems to significantly improve its classification.

"I'm a mean green mother from outer space" -- Audrey II, The Little Shop of Horrors

Working...