Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!


Forgot your password?

More on Bayesian Spam Filtering 251

michaeld writes "The "Bayesian" techniques for spam filtering recently publicized in Paul Graham's essay A Plan for Spam doesn't actually seem to have anything Bayesian about it, according to Gary Robinson (an expert on collaborative filtering). It is based on a non-Bayesian probabilistic approach. It works well enough, because it is frequently the case that technology doesn't have to be 100% perfect in order to do something that really needs to be done. The problem interested Robinson, and he posted his thoughts about trying to fix the problems in the Graham approach, including adding an actual Bayesian element to the calculations."
This discussion has been archived. No new comments can be posted.

More on Bayesian Spam Filtering

Comments Filter:
  • by rbrito ( 37104 ) <rbrito@ime.COWusp.br minus herbivore> on Tuesday September 17, 2002 @04:40PM (#4276216) Homepage Journal

    The timing of this article seems impecable, since I am myself trying to learn about Bayesian Statistics.

    I am a Computer Science student [ime.usp.br] studying Computational Biology [ime.usp.br] (more specifically, Sequence Alignments) and while I have a bit of background on Classical Statistics, I was (and still am) completely ignorant about Bayesian Statistics.

    It is only now that I'm trying to learn about Hidden Markov Models and its applications to Sequence Alignment that Ifinally decided to learn the basic hypothesis about Bayesian Statistics and how it differs from the hypothesis made by the Classical Statistics.

    During my searches for finding introductory material on Bayesian Statistics, I found this course page [arizona.edu] which has some nice introductory notes, including Bayesian Statistics.

    I hope that other people find this resource as useful as I did.

  • by DonkeyJimmy ( 599788 ) on Tuesday September 17, 2002 @04:45PM (#4276255)
    It's good that work is being done to make a good weigted spam filter.

    It's funny how bad the standard Microsoft spam filter is (the one present in outlook). It's simply a word lookup, where if the word is present the message is marked as spam. It looks for things like "for free?". You can see the full list here [iirusa.com], near the bottom. It's a little old, but not outdated (I think you can upgrade your spam filters, but I tested these, and the ones I tested work).

    The adult filter isn't any better.
  • Well... (Score:2, Informative)

    by ccarter ( 15555 ) on Tuesday September 17, 2002 @04:58PM (#4276372)
    I hate to give any kind of credit to M$ but they patented the idea of using Bayesian analysis for spam filtering circa 1995. They even had it in one of thier beta's. However the filters were tagging some of those fricking Blue Mountain greeting cards as spam (imagine that!) so Blue Mountain sued them on anti-competitive grounds and M$ pulled it. Blue Mountain wanted to have the spam filters universally pass Blue Mountain content but MS refused that on the grounds that if a user considers it spam then it is in fact spam to them (Hurray for the "bad guys"!). The law suit has been settled/dropped/died for reasons I don't know.

    Anyway I hear that the next version of MSN will have a Bayesian filter and that it will be introduced in an up coming version of Outlook Express (no idea about Exchange and Outlook).

    BTW I believe internally MS uses this technique for spam control and that they don't seem to have any spam problems.
  • by ivan256 ( 17499 ) on Tuesday September 17, 2002 @05:02PM (#4276405)
    For once a restrictive legislation would get 99% support... you don't see that everyday. like I mentionned before, I don't get our politicians, they say they work for us, they try to find clever ways to tax us, remove control that we used to have and all, but something on which they would get unprecedented support, they are simply sitting on the issue...

    Perhaps the problem is that the law would gain them less votes then a few hundred thousand dollars in campaing financing would. A large portion of the population isn't online, and a large portion of those who are don't care about spam, so your politician doesn't care either.

    Since this is such a trivial technical problem to solve, it's not really a big deal either way. I daily reduce 800 spam messages to five or six that make it through to my inbox just using procmail scoring, and I haven't had a false positive in years. I spend five minutes updating my procmailsc every six months to keep it effective. I suppose that I could use an automated system to generate my score file similar to what Paul Graham described, but when I only spend ten minutes a year updating my rules, it's going to be alot of years before it was faster to have written all that code. No need for sweeping legislation.
  • by KieranElby ( 315360 ) <kieran@dunelm.org.uk> on Tuesday September 17, 2002 @05:17PM (#4276516) Homepage
    Oh, and "Bayesian" is pronounced "BAY - ZEE - UHNN".
  • Re:Why just spam? (Score:3, Informative)

    by McFly777 ( 23881 ) on Tuesday September 17, 2002 @05:23PM (#4276575) Homepage
    Easy. Just re-run the spam filter on your 'cleaned' mail using a ruleset generated by splitting the mail into topical vs. everything else.
  • Re:Why just spam? (Score:2, Informative)

    by shrikel ( 535309 ) <hlagfarjNO@SPAMgmail.com> on Tuesday September 17, 2002 @05:31PM (#4276649)
    Have you tried Ms Outlook? It's got extensive rule-based sorting capability. It doesn't work for IMAP, and you mentioned IMAP leter in your message, but it's not clear that that's all you're dealing with.
  • microsofts trademark (Score:3, Informative)

    by portal9 ( 518319 ) on Tuesday September 17, 2002 @06:02PM (#4276940)
    why are we even considering this method when microsoft has a trademark on it? nothing can be done.. they have a lock on it. trademark here [uspto.gov]
  • by thogard ( 43403 ) on Wednesday September 18, 2002 @03:39AM (#4279771) Homepage
    another stupid patent? This isn't new, its been done with spam on usenet for years. Maybe someone should digout the cancelmoose's freiends as prior art?

Top Ten Things Overheard At The ANSI C Draft Committee Meetings: (9) Dammit, little-endian systems *are* more consistent!