Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
Spam

Working Bayesian Mail Filter 313

zonker writes "A real, working honest to god Bayesian spam filter. I've been waiting for something like this for a while (since I first read Paul Graham's research paper on this very topic a few weeks ago). Well here's POPFile, a small but extremely effective Perl script that runs on just about any system Perl does. After just a little training was I able to get very effective filtering out of it. From what I understand the new email client that comes with OS X Jaguar has a feature similar to this, but I don't know if it is true Bayesian. Hopefully this kind of feature will become more prevalant in client software as I see the Google results for it are growing."
This discussion has been archived. No new comments can be posted.

Working Bayesian Mail Filter

Comments Filter:
  • Whas that? (Score:2, Interesting)

    by cos(0) ( 455098 ) <pmw+slashdot@qnan.org> on Sunday November 03, 2002 @02:08PM (#4589004) Homepage
    Would anyone care to explain what is a "Bayesian" mail filter?
  • by Quixote ( 154172 ) on Sunday November 03, 2002 @02:12PM (#4589046) Homepage Journal
    Any server-side solutions (MTA==qmail, MDA==procmail) using this (Naive-Bayesian) technique out there?
  • by AT ( 21754 ) on Sunday November 03, 2002 @02:14PM (#4589055)
    The mozilla mail client is getting a Bayesian mail filter, too. See http://bugzilla.mozilla.org/show_bug.cgi?id=163188 . Unfortunately, it probably won't show up until after version 1.2 is released.
  • by davids-world.com ( 551216 ) on Sunday November 03, 2002 @02:16PM (#4589062) Homepage
    A true Bayesian filter, wow. Let's face it, statistical classifiers based von Bayes' formula are not really state of the art. They make false assumptions about the data (independence of features).

    More intelligent classification algorithms can solve non-linear problems far better. Check out Kernel Machines [kernel-machines.org] and, somewhat older, Maximum Entropy models.

    Enough nerd talk for today :-)

  • by hfastedge ( 542013 ) on Sunday November 03, 2002 @02:26PM (#4589133) Homepage Journal
    I don't know if it is true Bayesian


    You know, on this issue, you really depress me. You are clearly not of the academic nature, so your stance toward something thats probably way above your head really frustrates part of me.

    As long as you're not developing the idea, it shouldnt matter how it works as long as it works.

    I read the original article here as you did to. After all the mumbo jumbo about learning, i picked out one effective tip from the article on filtering my email: filter out HTML.

    With 1 line of regex I eliminate 95% of my spam:
    match and throw it out.
  • by bmwm3nut ( 556681 ) on Sunday November 03, 2002 @02:26PM (#4589138)
    that's the beauty of this approach. the filter learns all the time (or atleast you can set it up that way). so if spammers get smart, it doesn't take long until the filter adjusts. what i'd love to see is this filter built into a mail client where you have two buttons for delete. one, just to delete the mail, the other to delete it and mark it as spam. when you press that button the filter would scan the email and update its rules.
  • by cmeans ( 81143 ) <chris.a.meansNO@SPAMgmail.com> on Sunday November 03, 2002 @02:33PM (#4589174) Journal
    James [slashdot.org] is a 100% Java Email server (SMTP, POP3, NNTP, and IMAP soon) that supports mail-server extensions via the Mailets API [apache.org]. I developed a Java implementation of the Bayesian rules discussed, so that they could be used in any configuration, but also provided a mailet wrapped implementation so that the filtering (or flagging) could be done at the server side.

  • by Saint Aardvark ( 159009 ) on Sunday November 03, 2002 @02:34PM (#4589180) Homepage Journal
    A version of this for Outlook Express.

    I work on the helpdesk of a small ISP; I also take care of the spam filtering, and answer abuse@. We recently added SpamAssassin, and God does it rock [dowco.com]. (The big spike you see is me getting MRTG to graph what SA catches now; it's 6-10 times better than what we used to catch.)

    But I still get complaints from our customers about spam that gets through. Just the other day a crapload got through because it was relatively subdued spam (no webbugs, NO LINE OF YELLING, etc); unfortunately, it also advertised pictures of young boys having sex. It's hard to explain why it's very, very hard to filter for this sort of thing, especially when I'm going through the talk for the nth time this week. (I need a good analogy that non-geeks can understand; I'm still looking.)

    The good folks at DeerSoft [deersoft.com] have a version of SpamAssassin for Outlook, and are promising one for OE Real Soon Now. But I would loooooooooooooooooooooooove a good spam program -- this or SA or something else -- that I could point our customers to. Download, double-click, say yes, and bam it's installed. I can figure out how to install this on a Unix box; I could probably, eventually figure out how to do it on a Windows box; there's no way the customers could do it.

    Or am I missing good, free spam filtering for Windows? Can anyone point me in the right direction?

    Slightly OT: There has got to be a huge market for setting up spam filtering for small businesses. My idea: Tell 'em that if they provide the box -- an old Pentium or 486 will do -- I'll set up spam filtering and a firewall on it, set up some maintenance tools (whitelist this, firewall that). They get great mail service, I get $x00.

  • by koreth ( 409849 ) on Sunday November 03, 2002 @02:44PM (#4589239)
    I've been using SpamProbe [sourceforge.net] (which gets invoked from procmail) with excellent results.
  • Staged Categories (Score:2, Interesting)

    by irritating environme ( 529534 ) on Sunday November 03, 2002 @03:05PM (#4589348)
    An advertised false positive rate of 0% is nice, but why not additional research into the spam, to attempt to categorize into blatant spam, probable spam, borderline, and non-spam, and see if false positives can be plopped into the borderline categories.

    Also, from what I saw in the article, there will already be a next level that spam can take: image-based messages, misspellings of key words (klik, Clic, Clik, etc), using 0xfe0000 for almost-bright-red.

  • by Anonymous Coward on Sunday November 03, 2002 @03:35PM (#4589523)
    A True Jedi Nerd would use compression based classification. Make two zip/gz/bz2/lzw/whatever archives, one containing known-not-spam and one containing known-spam. For each incoming mail, add to both archives, see which compresses better, bingo, that's the category it's supposed to be in. Obviously needs some tweaking (blocksizes etc.) but that's the gist of it.

    Apparently, it does work, though I can't whip out the references just now.

    Anyway, naive bayes is interesting mostly because it's so damn fast and only requires one pass through the data; and it works well, it just makes you feel stoopid because it's called "naive".
  • by Lenbok ( 22992 ) on Sunday November 03, 2002 @03:41PM (#4589567)
    Actually compresssion-based techniques don't work particularly well, mainly because they are very sensitive to the amount of training data. If you have a lot of non-spam mail, your non-spam compressor will compress better than your spam compressor.

    In the long view, all compression is machine learning anyway :-)
  • by archeopterix ( 594938 ) on Sunday November 03, 2002 @03:46PM (#4589598) Journal
    Hm... what about an anti-anti spam filter that mangles the message inserting random misspellings into the spam-identifying words? The bayesian filter would perceive this as a message consisting of many 'unclassified' words, just like a message in some unknown language. Sure, the short words probably haven't got many possible misspellings (cock, c0ck, coock, cokc - hm... starts to look undecipherable ), so they would probably get classified after some time. And this would hopefully lower the spam success ratio. But the possibility still remains...
  • Multi-purpose tool (Score:3, Interesting)

    by B'Trey ( 111263 ) on Sunday November 03, 2002 @04:10PM (#4589736)
    An interesting idea that I haven't seen discussed is using this concept for more general uses. If we can sort spam from non-spam, how about business from personal? Technical from administrative? All you'd need is multiple databases of word probabilities, the ability to assign emails to multiple categories and a hierarchical method of sorting.
  • by PigleT ( 28894 ) on Sunday November 03, 2002 @04:21PM (#4589814) Homepage
    Dual feedback loops. Every mail that matches spam gets fed back into the system so both the is-spam wordlist AND the is-good wordlists become more "concentrated" over time.
    Ifile does this, bogofilter does this with some wangling in procmail, ...

    That way, if someone sends something that's still mostly spam (one or two words in common with spam, enough to tip the balance) then all the neutral words will be tarnished as well.
  • by archeopterix ( 594938 ) on Sunday November 03, 2002 @04:45PM (#4589952) Journal
    Dual feedback loops. Every mail that matches spam gets fed back into the system so both the is-spam wordlist AND the is-good wordlists become more "concentrated" over time. Ifile does this, bogofilter does this with some wangling in procmail, ... That way, if someone sends something that's still mostly spam (one or two words in common with spam, enough to tip the balance) then all the neutral words will be tarnished as well.
    This is clever, but might have some undesirable side effects. Suppose a spammer attaches a long list of neutral words to his e-mail in order to 'dilute' the bad words. This way some innocent words might get assigned positive spam probability thus resulting in false positives later.
  • by Tim Browse ( 9263 ) on Sunday November 03, 2002 @04:51PM (#4589991)
    One interesting fact that came out of these statistical analyses of spam was from one that was featured a while back on slashdot - the guy was doing word analysis, and was looking for good spam indicators/correlations, and expected "sex" or "teens" to be a good match, but the best word was, surprisingly, "ff0000". This was because so much spam uses HTML mail with red text.

    So if nothing else, it will force spammers to stop using red text - that has to be some kind of victory :-)

    Tim
  • by devphil ( 51341 ) on Sunday November 03, 2002 @07:30PM (#4590909) Homepage


    So, the graduate CS course I'm taking this quarter is Evolutionary Computing, which is all about the convoluted nonlinear multidimensional-search-space problems, and guess what our current homework is? That's right, taking statistics on spam data, and using genetic algorithms to evolve a working spam filter.

    Due to one typo and two thinkos in my fitness evaluation function, my algorithm evolves -- within only a few dozen generations -- a solution which looks like this:

    Ignore the actual contents of the message. 34% of the time, it's spam.

    And it's right.

  • by Anonymous Coward on Sunday November 03, 2002 @08:09PM (#4591135)
    Actually, in my experience, spam is written by very intelligent people to look a very specific way to reach a very specific audience.

    There is nothing accidental or slap-dash about the layout, or use of colour, or any of the factors involved in laying out an email that will generate sales. I know this because it's my job to know about - I'm in the porn business.

    You might hate spam - I know I do - but it works. It works very well. And the way the email looks makes it work best of all.
  • Spamassasin (Score:3, Interesting)

    by fireboy1919 ( 257783 ) <rustyp AT freeshell DOT org> on Sunday November 03, 2002 @08:59PM (#4591345) Homepage Journal
    This seems to be about using strange approaches to spam filtering, but really...a bayesian network seems to be a natural step for a system that henceforth was composed of a series of heuristics with no knowledge of which is more important.

    (Why hasn't it been done? Bayesian networks are only taught in AI and statistics classes).

    What really interests me is that Spamassasin claims to use a genetic algorithm [spamassassin.org] to rate how likely an e-mail is to be spam.

If you want to put yourself on the map, publish your own map.

Working...