Forgot your password?
typodupeerror
Spam

More on Bayesian Spam Filtering 251

Posted by michael
from the snake-eyes dept.
michaeld writes "The "Bayesian" techniques for spam filtering recently publicized in Paul Graham's essay A Plan for Spam doesn't actually seem to have anything Bayesian about it, according to Gary Robinson (an expert on collaborative filtering). It is based on a non-Bayesian probabilistic approach. It works well enough, because it is frequently the case that technology doesn't have to be 100% perfect in order to do something that really needs to be done. The problem interested Robinson, and he posted his thoughts about trying to fix the problems in the Graham approach, including adding an actual Bayesian element to the calculations."
This discussion has been archived. No new comments can be posted.

More on Bayesian Spam Filtering

Comments Filter:
  • by Jeffrey Baker (6191) on Tuesday September 17, 2002 @04:40PM (#4276220)
    I'd like to head the results of anyone who has implemented one of these probabilistic filtering systems. I implemented a modifed version of Paul Graham's system and so far it kicks ass. So far it has trapped over 600 spams without any false positives. I receive almost 100 spams a day and over the last week I have generally only had to delete one or two by hand. The rest go directly to jail.

    I'd like to hear about modifications to this system. I removed Graham's doubling of "good" word frequencies, and I trained my filter using digrams. I also tried all the various methods supplied by the program "rainbow", with good results, but the implmentation was too slow and klunky to place in the middle of my email delivery system. What are other possible modifications?

  • by ajm (9538) on Tuesday September 17, 2002 @04:41PM (#4276223)
    ...is in the eating. I think the same applies to spam. Paul showed, to his satisfaction, that the technique he used worked for his samples. Gary proposes some changes that would improve the filter's accuracy, but does not test these theories.

    We will now have many slashdot posts saying "I've not tested this but I think A (or B, or C, or X)"

    Here's where the scientific method comes into its own. Anyone who cares enough can actually test and post their results. I'd be interested in seeing what they look like. I don't have a database of spam to test against (and please don't volunteer to sign me up for some :) but it would be interesting to see whether what looks convincing in theory pays off in practice.
  • by Anonymous Coward on Tuesday September 17, 2002 @04:48PM (#4276276)
    Finally it is worth mentioning that if you really want to go a 100% Bayesian route, try Charles Elkan's paper, "Naive Bayesian Learning". That approach should work very well but is a good deal more complicated than the approach described above.
    Here is the article [nec.com][citeseer.nj.nec.com]
  • by kwerle (39371) <kurt@CircleW.org> on Tuesday September 17, 2002 @04:48PM (#4276282) Homepage Journal
    I implemented Paul's system without the changes you mentioned, and am seeing >95% success (and climbing). 0 false positives. I will be submitting it to sourceforge this week.
  • by Anonymous Coward on Tuesday September 17, 2002 @04:50PM (#4276293)
    Is this what the new Mail.app in Mac OS X 10.2 uses?

    I, myself, am not sure but the new Mail.app is smart and it does learn. After a week of "learning" it has correcly determined messages as spam more than 99 out of a 100 times.

  • by ShakaUVM (157947) on Tuesday September 17, 2002 @04:53PM (#4276323) Homepage Journal
    At UCSD, Bob Boyer and I wrote a neural net spam filter. Neural Nets, as everyone knows, are not really like biological brains, but really just statistical engines similar to the approach the guy above claimed to do.

    Our approach worked pretty well (95-97% accuracy), and we had to deal with the same issues that the above "Bayesian" approach did. I.e., weighing the neurons so that false positives occur much less frequently than false negatives, etc. We built it using data on spam collected from the UCI machine learning repository.

    It ties in with procmail. I'm not really a windows guy, so if anyone knows how to put a filter between an IMAP server and Microsoft Outlook/Netscape Communicator, I'd be interested in hearing how it's done.

    The README for it is at: http://www-cse.ucsd.edu/~wkerney/spamfilter.README
    And you can download it at:
    http://www-cse.ucsd.edu/~wkerney/spamfilter.t ar.gz

    -Bill Kerney
    wkerney at ucsd.edu
  • SpamAssassin - duh (Score:3, Interesting)

    by Gothmolly (148874) on Tuesday September 17, 2002 @04:55PM (#4276346)
    SpamAssassin [taint.org] works great for me. It eats about 90% of my spam, you just hack up a little procmail file for it, and you're done.

    With so many people using SpamAssassin these days, I can't see how this is a timely or newsworthy item. More like from the been-there-done-that-dept..
  • by Jeffrey Baker (6191) on Tuesday September 17, 2002 @04:56PM (#4276353)
    I hacked it together in Perl, to make use of the Berkeley DB interfaces and the MIME parsing modules. Took about 30 minutes. I'm working on a C library that could be linked into mutt or pine or whatever, but I'm finding the available MIME code in C cumbersome.

    You can grab the source here [saturn5.com], but it is specific to the exact way that my mail gets delivered (via offlineimap into maildirs).

  • by mack knife (96580) on Tuesday September 17, 2002 @05:14PM (#4276489)
    sites like yahoo, hotmail, etc are in a unique position to rid their users of spam.

    i don't see why they cant implement some system that scans incoming mail for its users' mailboxes, maybe does a checksum for each message or something, and if it finds that a number of its users are receiving exactly (or nearly exactly) the same message, assume it's spam. nuke the messages, and any new incoming ones.

    yeah, if such a system only scans a small number of mailboxes, it may filter out mailing list posts and so on. but it gets more and more reliable the higher number of mailboxes it tracks.

    this avoids searching for certain keywords and eliminates false positives. after all, how well would these keyword searching methods do if i were to quote a spam message in an email to a friend?
  • by KieranElby (315360) <kieran@dunelm.org.uk> on Tuesday September 17, 2002 @05:15PM (#4276497) Homepage
    > Is there something "built in" to these filtering techniques that can be used by spammers to effectively circumvent them?

    Yes and no.

    To defeat a bayesian filter, the spammer needs to make his email contain similar words, and combinations of words, to your genuine email, while at the same time making sure that the words used are different to those in known spam.

    So saying 'click here to make $$$' won't work any more, since most of your regular emails don't contain the word combinations 'click here' and 'make $$$', whereas known spam emails will.

    However, we're already beginning to see spammers making their emails less obviously spam.

    For example, the spammer may use an email along the lines of:

    "How's things?

    Have you seen yet?

    Don't forget to mail me those documents.

    Regards,
    A Spammer"

    Even a bayesian filter will struggle to distinguish that from:

    "Have you seen the story on slashdot yet?

    Don't forget those reports.

    Regards,
    Your Boss"
  • by kwerle (39371) <kurt@CircleW.org> on Tuesday September 17, 2002 @05:25PM (#4276593) Homepage Journal
    So have you been retraining the system as you get more spam

    I continue to train.

    or did you train it initially and leave it that way. How large is your training set?

    I started off with a base.

    Details! My training set was 300 spams and 3500 not-spams.

    I started with a little more than 300 spam, and around 1000 valid messages.
    My count is now:
    Good messages read: 1194
    Bad messages read: 644

    That's because I only train on deleted mail, and I don't tend to delete my mailing lists except for once a month or 2...

    With digrams, my filter traps 618 out of 621 spams in my spam folder, which is 99.5%

    Against my start set, I nailed about 97%, including refiling 2 false positives from my old anti-spam system as being not spam. I've noticed that the system is really good at nailing stuff it already knows about, but the learning curve is a little steep for 'new spam types'. Still, I'm pretty happy with it.
  • by XDG (39932) on Tuesday September 17, 2002 @05:33PM (#4276675) Journal
    I've implemented it in part -- my code is in perl and will flag e-mails, but I haven't worked it into a filter yet.

    My experience is that I get a few percent false-negatives and about 1% false positives. I'm not seeing zero false positives, like many people are, but that probably has to do with the training sets used. Statistically speaking, you always have to trade off false negative with false positives, so it's reasonable in my 'real world' tests.

    As a side note, everyone should test out of sample. E.g. set aside half your good e-mails and half your spam e-mails, build the filter on one half, and then test on the other half. That's the only way to get a fair test of the filter.

    For my "good" email corpus, I dumped my entire e-mail archive since 1995. That included personal e-mail, receipts from online shopping, some mailing lists, etc. The few things that get flagged as spam (a) are almost always sent in HTML format, and (b) very short with little real content. (E.g., "Hey, looking forward to seeing you this weekend. Call me if you go out. My number is... Bye.")

    The spam corpus I took from on online resource while I build up my own. The e-mails that slip by unflagged are usually (a) short and (b) phrased like friend making a suggestion. (E.g., "Hi, I just thought you'd be interested in hearing about a this new, cool website, http://...") It seems to be close enough to a real message to slip through. Thankfully, few of them are like that.

    I'm including subject lines, from addresses, and the body so far. I'm not parsing ip addresses or html tags specially, however, just basic words using a simple perl regexp.

    Interestingly, "COLOR" is the one of the most often flagged words indicating spam. HTML formatting text seems to be the biggest culprit in my false positives. I might explicitly exclude the ones that show up in good mail (e.g. from friends who use crappy e-mail programs like aol) like COLOR, FONT, FACE, etc., but leave in the ones that spammer use like TD, TR, etc.

    -XDG

  • by Wile E. Heresiarch (12248) on Wednesday September 18, 2002 @12:44AM (#4279235)
    Here are some additional references, on-line & off, about Bayesian probability.

    On the web, see: Assoc. for Uncertainty in Artificial Intelligence [auai.org] -- this is the primary conference devoted to belief networks, which are a class of graphical (in the circles and arrows sense) Bayesian probability models. There are tutorials and other papers on the main AUAI web page, and links to the last several years of conference proceedings. By the way, Heckerman and Horvitz, now doing belief networkish work at MS Research, are in the AUAI crowd.

    In print, my favorite reference is E.T. Jaynes, "Probability Theory: The Logic of Science", which is due out soon. See this web site devoted to Jaynes' work [wustl.edu] for the status. I am also fond of Castillo, Gutierrez, & Hadi, "Expert Systems and Probabilistic Network Models".

    There are a vast (well, maybe just large) number of alternative models to classify things; a good introduction is Hastie, Tibshirani, & Friedman, "Elements of Statistical Learning". Incidentally, they use spam classification to illustrate several kinds of models.

    Finally, if you're wondering what the heck is the difference between Bayesian probability and any other kind -- just google the posts in sci.stat.math; there is a Bayesian vs frequentist flame war about once a year. :^)

Never invest your money in anything that eats or needs repainting. -- Billy Rose

Working...