Slashdot stories can be listened to in audio form via an RSS feed, as read by our own robotic overlord.

 



Forgot your password?
typodupeerror
Spam

More on Bayesian Spam Filtering 251

Posted by michael
from the snake-eyes dept.
michaeld writes "The "Bayesian" techniques for spam filtering recently publicized in Paul Graham's essay A Plan for Spam doesn't actually seem to have anything Bayesian about it, according to Gary Robinson (an expert on collaborative filtering). It is based on a non-Bayesian probabilistic approach. It works well enough, because it is frequently the case that technology doesn't have to be 100% perfect in order to do something that really needs to be done. The problem interested Robinson, and he posted his thoughts about trying to fix the problems in the Graham approach, including adding an actual Bayesian element to the calculations."
This discussion has been archived. No new comments can be posted.

More on Bayesian Spam Filtering

Comments Filter:
  • by ajm (9538) on Tuesday September 17, 2002 @03:47PM (#4276275)
    Just out of interest what's your code written in and would you consider posting it?
  • by frovingslosh (582462) on Tuesday September 17, 2002 @03:50PM (#4276299)
    Sadly, unless you are an ISP or other mail service provider, filtering does nothing. The spammers work in volume. They count on hitting everyone to reach that .1% that will respond. That response is what they are after and what they get paid for. You likely know better than to ever deal with anyone who spams you or to ever respond to their spam. Filtering your own e-mail has absolutely no effect on the spammer, you were not going to respond anyway. By the time you filter they have already wasted your bandidth, and perhaps mailbox capacity and even forwarding limits from a forwarding service. Your filtering is useless, puny human!

    Here is a suggestion for something that might make an impact on spammers: IF I open my firewall, I see several attempts a day from people trying to get into my mail server. Of course, I don't have a mail server, but spammers are always looking for open relay points they can spam from. My suggestion: Give the a nice open relay server they can send mail to. Of course, you don't want to piss off your service provider by sending spam, and your upstream speed might limit you to less than you can receive, so rather than run a full mail server lets modify some mail server code to just accept mail and send it to the bit bucket. Maybe we can even misconfigure existing code to do this with no programming changes.

    No valid user will be affected, assuming you don't otherwise run a mail server. All that bandwidth you pay for can be used to receive e-mail from spammers before it ever goes out. Eventually their customers will see the response go from .1% to 0% and their business will dry up. This will impact spammers, blocking your own spam after it's been delivered will not.

    This need not even impact your own bandwidth. You can run the server when you are done using your system (Might make a nice screen saver - a black screen that just shows how many spammed addresses were prevented from getting spammed). Or you cam impose limits on bandwidth at a firewall or router, or even restrict hours of access.

    If we set up enough different false open relay servers I think we could have a real impact on the spammers.

  • by shadow303 (446306) on Tuesday September 17, 2002 @03:54PM (#4276328)
    From what I can observe from the writeup, Gray appears to be one of the "experts" that I refer to as "theory whores". Hard problems need to be tested, but some people seem to think that they can arrive at good results from an unproven theory. Anybody who has actually tested difficult problems to any extent could tell you that things don't always go as planned. An improvement with might work in theory, sometimes results in disaster due to minor points that the theory does not take into account.
    Also, it bothered me that he objected to Paul's work biasing one side. It was almost like he thought it was a bug, but there was a good reason for biasing (reduce false positives). So my advice for Paul is, until you actually implement your idea, don't go trying to say that it is better than somebody else's method.
  • I'm fairly sure a false relay won't work. Just like snail mail list sellers, the spammers salt their victim lists with their own valid addresses that they can check to see if the message is getting out.

    BUT, an early spam filter at an ISP worked just like that. The design parameters were 1) that spam filtering require no more resources than actual delivery of the message, and 2) the filter give no indication to the spammer that the message was not going to delivered. This gives the spammer no feedback and forces THEM to waste CPU cycles which will slow them down.
  • by Eric Seppanen (79060) on Tuesday September 17, 2002 @04:15PM (#4276496)
    Reasons why I don't use SpamAssassin:
    1. It tends to rely on blocklists, many of which have demonstrated unfair practices in the past.
    2. The more SpamAssassin is used, the more spammers will specifically avoid doing things SpamAssassin checks for.
    3. It's a gigantic heap of perl, the Write-Only (tm) language. I hate the fact that every perl program demands I mess up the package manager on my system by blindly downloading a half-dozen new modules. And it's slow!
    4. Bogofilter [sourceforge.net] is better. duh.
  • by stienman (51024) <adavis AT ubasics DOT com> on Tuesday September 17, 2002 @04:18PM (#4276525) Homepage Journal
    Interesting idea, but easy to verify. Send one thousand emails, and include a verifiable email in it. Check the email a few hours later - if it's not there, then don't use the relay.

    -Adam
  • by XDG (39932) on Tuesday September 17, 2002 @04:18PM (#4276531) Journal
    Gary is both right in some respects and irrelevant in others. Here's the key line in his article that deflates it a bit:
    It is untested as of now. It is based purely on theoretical reasoning. If anyone wants to try and it test it in comparison to other techniques, I'd be very interested in hearing the outcome.
    On the other hand Paul Graham has actually tested his model and it works. I've worked it up in perl and tested it on my own data set and it works there, too. Paul acknowledges that he's being a bit fast and dirty, but the proof is in the pudding. The rest is just academic quibbling over the fine points.

    I'm not sure why this particular article needed to be posted, as it's just one of several alternative approaches and an untested one at that. On Paul's page, he also lists several published academic papers with other alternatives -- all actually tested, of course.

    Gary is basically right in questioning the use of the word "Bayesian". Paul's approach is more about weighing "evidence" as given by the appearance of certain words, rather than in figuring out the probability of spam assuming a "prior". See Paul's explanation [paulgraham.com], but if you check the article he references at the end, you'll note that the method Paul uses is only one of several methods to solve an underspecified problems. It's a reasonable guess, not necessarily the only guess.

    Looking at another article [lanl.gov] Paul references, given the word independence assumption, the more formal Naive Bayesian approach calculates as follows:
    p(spam) = [ p(spam)*p(word1|spam)*...*p(wordn|spam) ] / [ p(spam)*p(word1|spam)*...*p(wordn|spam) + p(!spam)*p(word1|!spam)*...*p(wordn|!spam)]

    This is similar to Paul's approach except for including a "prior" assumption of p(spam) -- the expected probability of any email being spam, calcuated from the historically observed frequency of spam. By leaving it out, Paul implicitly assumes that 50% of mail is spam -- that's his "prior" estimate of the spam rate. Given the other adjustments he makes to his sample, that appears to be acceptable in practice. (Paul overweights the spam prior, but also overweights the effects of "good" words.)

    I'd personally prefer to overweight the "good" e-mails entirely rather than just put a "good-multiplier" on them like Paul does, but that's just quibbling over small bits.

    As to the bit that Gary raises about Paul assuming a spam probability for an unknown word -- Paul originally said .2, then revised to .4, but really should have put it at .5 or just excluded it from all calculations. A new word has no robustness as a predictor (which is why Paul dropped words that didn't appear five times anyway). In practice, a new word at .4 isn't going to be among the 15 most interesting words to make the calculation from, anyway.

    -XDG

  • by Eric Seppanen (79060) on Tuesday September 17, 2002 @04:58PM (#4276911)
    You might want to consider collaborating with the group [sourceforge.net] working on bogofilter, which is basically the same thing, done in C.
  • by Anonymous Coward on Tuesday September 17, 2002 @05:49PM (#4277353)
    Paul called his method Bayesian. It wasn't. In addition to just pointing out that Paul was wrong, Gary also outlined how one might take a Bayesian approach to the problem.

    He also showed how his extended solution included Paul's as a special case.

    It sounds like you frequently get terminology wrong, and when someone points out that you're using the term incorrectly, and further shows how you could actually apply what you were talking about to the problem at hand, you go off on them for being a "theory whore." You're the winner of today's "Slashdot personified" award. Congratulations!
  • by Anonymous Coward on Tuesday September 17, 2002 @06:37PM (#4277758)
    Check out the Spambayes project in SourceForge. They are working mostly in Python (python.org), and have a large collection of "spam fodder" and "ham fodder" to work with.

    I'm sure of you've done some code like Jeff Baker has, he sure would be welcome to participate in the group. They have a CVS library already in SourceForge.

You're already carrying the sphere!

Working...