Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Spam

Working Bayesian Mail Filter 313

zonker writes "A real, working honest to god Bayesian spam filter. I've been waiting for something like this for a while (since I first read Paul Graham's research paper on this very topic a few weeks ago). Well here's POPFile, a small but extremely effective Perl script that runs on just about any system Perl does. After just a little training was I able to get very effective filtering out of it. From what I understand the new email client that comes with OS X Jaguar has a feature similar to this, but I don't know if it is true Bayesian. Hopefully this kind of feature will become more prevalant in client software as I see the Google results for it are growing."
This discussion has been archived. No new comments can be posted.

Working Bayesian Mail Filter

Comments Filter:
  • spambayes.sf.net (Score:5, Informative)

    by supton ( 90168 ) on Sunday November 03, 2002 @02:10PM (#4589019) Homepage
    Saw this a few weeks back... [sf.net] Spam filter in Python using Naive Bayes.
  • Re:Whas that? (Score:4, Informative)

    by DalTech ( 575476 ) on Sunday November 03, 2002 @02:16PM (#4589064)
    Bayesian is statistical theory and methods useful in the solution of theoretical and applied problems in science, industry and government. http://www.bayesian.org/
  • bogofilter (Score:4, Informative)

    by stype ( 179072 ) on Sunday November 03, 2002 @02:18PM (#4589084) Homepage
    This isn't exactly the first bayesian mail filter out there. I've been using ESR's bogofilter [tuxedo.org] for weeks now, and I must say it works better than I could have ever imagined. Bogofilter however is simply for sorting out spam, while it appears this filter can sort out other things. But honestly, I can setup some simple filters to separate personal emails from work emails, so I'm not entirely sure the extra stuff is that useful.
  • Bayes Explained (Score:1, Informative)

    by brw215 ( 601732 ) on Sunday November 03, 2002 @02:18PM (#4589088) Homepage
    A naive bayes classifier is an algortihm that is based on bayes therom in mathematics. It is based on the following therom

    Pr(h|D) = Pr(D|h) * Pr(h)

    where Pr is probabilty, h is the hypothesis and D is the data. In this case it would be

    Pr("SPAM"|Email) = Pr(Email|"SPAM") * proportion of spam.

    The trick is how to estimate the second term. This is a very popular machine learning algorithm due to its simplicity and elegance. For more info, check out this link Bayes [cmu.edu]

  • Re:Whas that? (Score:5, Informative)

    by dvk ( 118711 ) on Sunday November 03, 2002 @02:19PM (#4589094) Homepage
    From what I understand, it is a mail filter which determines what to filter out based on a statistics-based machine learning system called "Bayesian Learning".

    A couple of URLs quickly found on Google:
    http://www.faqs.org/faqs/ai-faq/neural-nets/part3/ section-7.html [faqs.org]
    http://www.csse.monash.edu.au/courseware/cse5230/a ssets/images/week09.pdf [monash.edu.au]

    Also, any decent AI/machine learning textbook ought to cover the topic.

    -DVK

  • by outlier ( 64928 ) on Sunday November 03, 2002 @02:26PM (#4589136)
    While spammers will undoubtedly continue to refine the content of their messages, one of the strengths of using a Bayesian filter like this is that it uses the user's own spam and non-spam (ham) as the basis for its calculations. This means that messages are categorized not only by whether they contain spammy words, but also whether they contain the hammy words from your own messages. So, even if spammers could refrain from using words like "free" "mortgage" "sluts" and "spam", they probably wouldn't use words that discriminate your own ham from others (e.g., if you are a computer scientist, your mail may include hammy words like "algorithm" "compile" "project" or "stargate" that would help distinguish ham from spam. The challenge to the spammer would then be to target you with spam that looks like *your* ham (which is probably different from the ham of others).

    Future systems (assuming faster processors and more HD space) could include semantic analysis (e.g., Latent Semantic Analysis) to do an even better job and go beyond the word level.
  • Re:Bayes Explained (Score:5, Informative)

    by johnynek ( 36948 ) <boykin@pobox.com> on Sunday November 03, 2002 @02:36PM (#4589191) Homepage
    That's /. for you. You guys have modded up to 5 a post that is wrong in both of the equations it posts.

    It should be:

    Pr(h|D) = Pr(D|h) * Pr(h) / Pr(D)

    and:

    Pr("SPAM"|Email) = Pr(Email|"SPAM") * (proportion of spam) / (probability of getting this paticular Email)
  • by ptbarnett ( 159784 ) on Sunday November 03, 2002 @02:37PM (#4589200)
    Plugins - BayesSpam - Intelligent Spam Filter [squirrelmail.org]

    SquirrelMail [squirrelmail.org] is a WebMail client implemented in PHP. I use the client, but not the plugin (I use Razor [sourceforge.net]).

  • by Jamuraa ( 3055 ) on Sunday November 03, 2002 @02:42PM (#4589231) Homepage Journal
    Bogofilter [tuxedo.org] has been out since august, and does this bayesian spam-stuff in C, which probably will run a bit faster than the perl or python versions just because of it's compiled-ness. I've never run it myself, but people on debian lists say it works better [debian.org] or not as good [debian.org] as spamassassin [spamassassin.org].
  • by Anonymous Coward on Sunday November 03, 2002 @02:44PM (#4589244)
    I've been doing research into email filtering using AI, and SVMs/kernel machines seem to work well (statistically, they're correct more than the other methods), but they require massive tuning.

    On the other hand, Naive Bayes is usually easier to implement, easier to tune, and only trails by a few percentage points.

    One of the more promising bayes units is autoclass, offered by Cheeseman (et. al.) - public domain classifier that's been around for years and years, and seems to perform quite nicely.

  • by rgmoore ( 133276 ) <glandauer@charter.net> on Sunday November 03, 2002 @02:47PM (#4589262) Homepage

    Another important point is that there are some things that they can't hide, at least not in their current working model. If they're trying to sell you something, they have to describe what that thing is and where you can get it, and those descriptions are unlikely to be in any legitimate email. If they want to advertize a web site, they have to include its URL in the message, and the filter can catch that. If they advertize a physical address or phone number, the system can catch those, too. If they don't repeat the message, it means that there's inherently less spam, because I'm only seeing each add once.

    It's also not possible to disguise everything in their headers, so things like their posting host (either the one they pay for legitimately or any open relay they're taking advantage of) will wind up being a pointer to who they are. They certainly can't change anything about the headers that's added downstream of their posting host, so as long as they keep using the same one it's likely that there will be characteristic stamps there that the spammers absolutely can't change. I know that analysis of the headers is part of bogofilter [sourceforge.net], another Bayesian filter that I've been using to good effect.

  • by rgmoore ( 133276 ) <glandauer@charter.net> on Sunday November 03, 2002 @02:56PM (#4589313) Homepage

    Bogofilter [sourceforge.net] comes close to this. It has an operating mode where each file that it filters is automatically added to the appropriate corpus, either of spam or non-spam. Since it's correct the vast majority of the time, that means that there's very little for the user to do. When it is wrong, you just take the messages that it miscategorized and feed them back into the system with the notation that they were originally marked incorrectly, and it backs out the changes to the wrong category and adds them to the correct category.

    I'm using bogofilter with Evolution [ximian.com], and it works very well. I just have two extra folders, one for false negatives and one for false positives. When I notice mail that's been flagged incorrectly, I drag it into the appropriate folder and run a script that tells bogofilter to correct its mistake. Then I either flush the mail (if it was spam marked as non-spam) or process it normally (if it was non-spam marked as spam). I've only been using it for about two weeks and it already has a nearly zero false positive rate (i.e. incorrectly flagged as spam) and a usefully low false negative rate (i.e. incorrectly flagged as legitimate).

  • by bstadil ( 7110 ) on Sunday November 03, 2002 @02:57PM (#4589316) Homepage
    You know what I'd kill for?

    It might be smarter to read the article, than killing someone.

    You could have installed the program for Outlook in the time it took you to type your rant, but then you would not get any Mod point would you.

  • Re:Whas that? (Score:5, Informative)

    by sfe_software ( 220870 ) on Sunday November 03, 2002 @03:02PM (#4589341) Homepage
    If you had just clicked the POPFile link, you would see the explanation.

    I also highly recommend this link [paulgraham.com], as it goes into quite a lot of detail on this filtering technique. After reading it, I am going to give the Perl variation a shot.
  • Re:Bayes Explained (Score:4, Informative)

    by B'Trey ( 111263 ) on Sunday November 03, 2002 @03:07PM (#4589356)
    Read the referenced article. The only way to avoid the filter is to make your email sound like a normal message. In essence, the filter recognizes the sales pitch. If you remove the sales pitch to get your spam past the filter, you've removed the whole point of sending the spam.
  • Where's the news? (Score:4, Informative)

    by Roadmaster ( 96317 ) on Sunday November 03, 2002 @03:09PM (#4589364) Homepage Journal
    Just because it's the first one that actually makes the slashdot frontpage it doesn't mean it's the only one.

    Do a freshmeat search for bayespam, bogofilter and spamprobe, they're all working and quite mature bayesian filters (or should we say "paulgrahamian" in order to appease the "true bayesian" crowd). Hell, even a search for "bayes" will turn out a few more hits, like ifilter, which aims to automatically classify mail in different folders, but could be easily tuned to filter out spam.

    Of these, I think spamprobe is becoming the true "swiss army knife" of "bayesian" filtering; I did find both bogofilter and bayespam spartan, but they work well. spamprobe, on the other hand, is very actively maintained, is under constant improvement by the author, Brian Burton, and has given me excellent results getting rid of over 90% of my spam.
  • by rgmoore ( 133276 ) <glandauer@charter.net> on Sunday November 03, 2002 @03:20PM (#4589405) Homepage

    But perl scripts are just as easy to run as .exe files, so long as you have the perl interpreter installed. So now it's just a two step process:

    1. Install perl.
    2. Install the perl script.

    This is not exactly brain surgery. Perl can be installed on essentially any system you choose to name, with no more trouble than installing any other executable. For those people running Windows, there's an excellent port available from Activestate [activestate.com]. As somebody else pointed out, this means that a perl script is actually available to more people than a .exe would be, because it's truly cross-platform.

  • by ceswiedler ( 165311 ) <chris@swiedler.org> on Sunday November 03, 2002 @03:34PM (#4589519)
    I don't think you're talking about the Skinner box [uni-wuerzburg.de], which is a device used in the psychology of learning, but rather the Chinese room [wustl.edu], which is John Searle's take on AI and the Turing test.
  • by dzym ( 544085 ) on Sunday November 03, 2002 @03:38PM (#4589550) Homepage Journal
    Yes, but remember, who runs the SMTP servers?

    The very design of the whole system specifies that anyone can just turn on a machine, hook it up to a network somewhere, and start spewing out messages to smtp ports all over the world.

    It doesn't have to be a sendmail, qmail, or exim server, remember. Some Windows viruses have taken advantage of that loophole to set up mini-SMTP servers in the network stack to continue propagating viruses without needing to connect to anything that provides authenticated external relay.

  • Re:Ximian Evolution? (Score:2, Informative)

    by rgmoore ( 133276 ) <glandauer@charter.net> on Sunday November 03, 2002 @03:40PM (#4589559) Homepage

    With some cleverness, you can use any outside filter with the most recent version (i.e. the develpment fork) of Evolution. They've added the ability to pipe incoming messages to an outside program and read back the exit code. So if the program is written using standard Unixisms- i.e. it reads on standard input and returns a different value depending on whether the incoming message is spam or not- it can be used with Evolution. I know that bogofilter [sourceforge.net] can do this because I'm using it with Evolution and it works pretty well.

  • Missing the point? (Score:5, Informative)

    by crisco ( 4669 ) on Sunday November 03, 2002 @03:46PM (#4589600) Homepage
    I think lots of people here are missing the point of POPFile. Everyone is happy to point out that there are already several assorted solutions to Bayesian mail filtering in many different languages. Nearly all of these work on the mail server. Now lots of us are qualified and interested in setting up our own mail server, customizing the mail processing our own One True Way and happily enjoying an inbox free of spam. But the average windows user has no idea how to set up a mail server. Others could easily do it but feel their time is better spent on other things, not admining a mail server.

    This is what POPFile is for. Its a pop3 proxy server, it sits between your pop3 client and the server and simply adds a classification to the headers (or the subject line for braindead mail clients).

    Currently POPFile is a bit rough on computer newbies, it needs a Perl install and such. However, if you read the forums it is intended to end up as an easily installed executable for windows users and to remain a nifty little perl script for the rest of the platforms where it might come in handy. So when those pesky friends and relatives come asking about all the viagra and farmyard spam they get (and you haven't already set them up on your tightly filtered mail server) set up POPFile for them.

    Also, its not just for spam filtering. Think of what you could do if you could go beyond simple rules for your inbox. Want email you think is important forwarded to your phone? Create a category for important email and go through your archives and feed POPFile email you would have wanted forwarded instantly. Create a new folder to recieve those mails and watch it for a few days, retraining POPFile until it is getting reasonably good at putting important mail in there. Now set up your mail system to forward those to your phone. Will it work? I don't know, but based on the results I'm getting, it probably would. How about using it to filter help desk emails?

  • by marmoset ( 3738 ) on Sunday November 03, 2002 @03:47PM (#4589604) Homepage Journal
    Over the last month or so, I've received a few really strangely worded porn spams that seem to be engineered so as not to trip ISP porn filters. They use lots of passive verbs, no exclamation points, no HTML, and dictionary definitions of whatever kink the spammer is selling.

    Since I use Jaguar's mail client, I just told it that these were spam too and now it catches them by itself. :)
  • Re:Bayes Explained (Score:4, Informative)

    by Jim Nugent ( 619564 ) on Sunday November 03, 2002 @03:55PM (#4589647)
    To put this in simpler terms, consider this scenario, 90% of all all X-rays that have a certain feature are from women with breast cancer. That is an easy statistic to compute; you have the x-rays and you follow up with the patients.

    The trick is derive a statement like: "If an x-ray has this feature, the patient has NN % chances of having breast cancer. THAT's useful tor screening, but it doesn't follow from the first statment (without some serious statistical calculations).

    Bayes theorem has all sorts of applications in prediction. In the case of E-mail, we can greatly oversimply and say "We found that X% of E-mails with this subject line are Spam." "We conclude that an E-mail with this subject line has Y% odds of being spam." Note that these are two very different statements. If we can find Y for the second statement and set a threshold we're comfortable with, say, 95% then we can create a filter with 95% confidence of correctness; it may well be wrong 5% of the time.

    Other responses have done a good job with the math so I won't repeat it here.
  • Re:Bayes Explained (Score:1, Informative)

    by Anonymous Coward on Sunday November 03, 2002 @04:12PM (#4589755)
    I think that the original poster dropped the /Pr(D) term because, in the *cough* referenced articles *cough* they dropped it, since they were only comparing the different Pr(h|D)'s among various email folders ("buckets"), and the /Pr(D) term was the same in all of them.

    Thanks for posting the (correct) general form of the equation, though.
  • by Anonymous Coward on Sunday November 03, 2002 @04:32PM (#4589873)
    They make false assumptions about the data (independence of features).


    NOT TRUE! The Bayesian approach can use the full correlation matrix without diagonalization, e.g., you can write the algorithm to correctly account for the fact that a probability of word A, given that word B is also in the email, is not the product of the probabilities of A and B separately. The only downside is that the number of weight the database contains goes as N^2, so storage space and speed can lack.

  • by crucini ( 98210 ) on Sunday November 03, 2002 @05:11PM (#4590130)
    It would be great if someone ported this, to an .exe file or something that everyone could run.

    I don't think an .exe would help much - a Windows user doesn't need a standalone executable. He needs a filter (probably a .dll) coded to the specific filtering API of his mail client. Or does Microsoft have a generic mail filtering API? That way the filter seems to run "inside" the mail client.

    In general this illuminates one of the advantages of Unix. Lots of programs are written as filters that read from STDIN (standard input) and write to STDOUT (standard output). My own mail filtering script, for example, does that. I didn't have to learn any mailer-specific API, and my script can be used in different contexts. (Actually my script doesn't write to STDOUT - it saves the message to the appropriate folder.)

    Windows does not lend itself to the everything-is-a-filter idea because, among other things, process creation is slow and expensive. When a filter is invoked, a process is launched. Unix has more efficient process creation, and Linux has especially efficient and light process creation. Therefore on Windows a mail filter should be implemented as a reusable software component (probably a COM object) that can be called by the mail client.

    Also, most mail clients on Unix use the same mail folder format (mbox) which is basically just the literal messages from the network written to a file. Since it is the assumed common language of mail folders, it encourages software to interoperate on the file level, which my script does by writing messages to mail folders. (Unix is file-centric.) Windows mail clients, in contrast, seem to store mail folders in proprietary formats. That's because Windows philosophy is that an application serves as gatekeeper to "its" files - the file is not a unit of interoperability. In our case it means a standalone mail filter probably couldn't write messages to the mail folder.

    Unix is a more friendly, efficient development environment because you can write a mail filter as a standalone program and test it without building a test harness.
  • by disarray ( 108 ) on Sunday November 03, 2002 @05:55PM (#4590366)
    Future systems (assuming faster processors and more HD space) could include semantic analysis (e.g., Latent Semantic Analysis) to do an even better job and go beyond the word level.

    Welcome to the future: the mail client [apple.com] in Mac OS X 10.2 uses latent semantic analysis. (This isn't just marketingspeak--my mail folder includes "LSMMap"--LS as in "latent semantic".)

  • Re:spambayes.sf.net (Score:2, Informative)

    by mpieters ( 149981 ) on Sunday November 03, 2002 @06:54PM (#4590676) Homepage
    Note that the spambayes core has been developed by Tim Peters of the PythonLabs team, someone who has tons of experience with statistical schemes and the fine-tuning of them. The results from this filter so far have been fenomenal.
  • Re:Bayes Explained (Score:2, Informative)

    by brw215 ( 601732 ) on Sunday November 03, 2002 @06:59PM (#4590711) Homepage
    Actually I didn't forget it. Typically in Bayesian expression the denomonator Pr(D) is dropped, meaning there is no more probabilty of any one email then any other.
  • by JPZ ( 42691 ) on Sunday November 03, 2002 @07:03PM (#4590733)
    A true Bayesian filter, wow. Let's face it, statistical classifiers based von Bayes' formula are not really state of the art. They make false assumptions about the data (independence of features).

    Bullshit. Bayes' formula is exact, and makes no assumption on independence whatsoever. Naive Bayesian approaches make independence assumptions, hence the use of the term naive.

    The only inherent drawback in using Bayes' rule in classifiers is that you have to assume the number of classes to be known a priori.

    JPZ
  • by Zuke8675309 ( 470025 ) <ty DOT zucker AT gmail DOT com> on Sunday November 03, 2002 @07:09PM (#4590768)
    Exactly true. POPfile isn't just about filtering spam. It's about sorting email. Slightly different. One could think of the nuance this way - out of all the email you get you could teach POPfile to filter out the GOOD email and delete everything else. I've found POPfile extremely useful for bringing order to the clutter of my inbox. I have buckets for spam, fantasyfootball, personal, and several work related subject matters. I just pull up the web interface, classify the messages properly and POPfile works it's magic.
  • by barfy ( 256323 ) on Monday November 04, 2002 @04:29AM (#4592944)
    This whole methodology is already patented by Microsoft. ANY implementation not licensed by Microsoft is going to be a violation... And now that you know, it is treble damages...

    patent 6,161,130 [uspto.gov]
  • by NNland ( 110498 ) on Monday November 04, 2002 @01:43PM (#4594426) Homepage
    I hate to mention this, but I will anyways.

    Popfile was announced here in late August, shortly after the Paul Graham article came out. It was originally closed source, which prompted the creation of multiple other projects. Among them is Spambayes [sourceforge.net] and even my own Pasp [sourceforge.net] (both in python, both open source).

    As well, Popfile was announced open source at the end of September...on Slashdot. I know this because it was released under such a license as I was finishing up Pasp.

    So yeah. As for how well Popfile categorizes mail into multiple categories, I have not run many tests with multiple category bayesian filtering, though the Spambayes group has, and has discovered that filtering mail based on multiple categories is far less accurate (many false categorizations). In the minimal tests I have done, I find this to be the case as well (we are used to less than 2% FP and FN rates, and with >2 bin categorization, error rates spike easily into the 10% range).

    So yeah. Popfile has been announced here no less than 3 times now. I've not seen Spambayes announced at all (they deserve it), and Pasp has also not been announced, though I could care less about that.

UNIX was not designed to stop you from doing stupid things, because that would also stop you from doing clever things. -- Doug Gwyn

Working...