Forgot your password?

More on Bayesian Spam Filtering 251

Posted by michael
from the snake-eyes dept.
michaeld writes "The "Bayesian" techniques for spam filtering recently publicized in Paul Graham's essay A Plan for Spam doesn't actually seem to have anything Bayesian about it, according to Gary Robinson (an expert on collaborative filtering). It is based on a non-Bayesian probabilistic approach. It works well enough, because it is frequently the case that technology doesn't have to be 100% perfect in order to do something that really needs to be done. The problem interested Robinson, and he posted his thoughts about trying to fix the problems in the Graham approach, including adding an actual Bayesian element to the calculations."
This discussion has been archived. No new comments can be posted.

More on Bayesian Spam Filtering

Comments Filter:
  • by Anonymous Coward
    kill 'em. might = right
  • Well I guess spam comes in different size tins sometimes, and with different labels so you can tell the spam apart. I like Hot and Spicy Spam. Mmmm.

    Of course, the 1% of non-spam that accidentally gets filtered out is just collateral damage (except it's normally something really important like a tin of processed peas or something).

    I'm going to sit down now and take some more HGH.

  • by sstory (538486)
    Someone came up with this idea recently, and I like it, so I've been repeating it. Instead of illegalizing spam, which i would love if it worked, but it won't, require spammers to indicate the nature of the email--anonymous, commercial, with a word or such in the subject line, which can then be filtered by individual recipients according to their desires. It would not be as free-speech-limiting as banning spam, and spam would die out due to ineffectiveness once most everyone filtered it.
    • yeah, theyre all ready supposed to indicate if its pr0n spam by specifying in the subject, hot sluts inside, or whatever, but they ussually don't, and there isn't a really good way to enforce this, 9 times outa 10 when i get an e-mail from Joey, i know 3, and the subject is hey, or hows it going or something, the actual e-mail is pr0n
  • Why is such a simple problem that pisses off 99.9% of the population is so hard to manage on a global scale? I mean, EVERYONE is pissed off at getting spammed, everyone would LOVE legislation to sodomize local spammer with a baseball bat, oversea is a different problem but country/continent-wide spam is 1/2 of my problem and can be easily be taken care of with proper legislation. For once a restrictive legislation would get 99% support... you don't see that everyday. like I mentionned before, I don't get our politicians, they say they work for us, they try to find clever ways to tax us, remove control that we used to have and all, but something on which they would get unprecedented support, they are simply sitting on the issue...

    Until politicians will be fed up and people will actually get SUED for spamming (for once you could have a good reason to sue real bad guys) nothing will change.

    Yes I know in SOME states it's beginning, so for local spam in a few years from now I think legislation will make it's way and we'll be able to look in our mailbox and stop having TD waterhouse spamming when you already have an account with them, etc.

    The other problem now is oversea spamming, especially coming from China/Taiwan. I mean.. I don't read chineese, I don't plan on buying that #.#" something oversea, so why do they spam us like that? I never get it, but I'd be all for passive euthanasia (i.e. ban their IP at router level) and if this is bad for buisness or relations or whatever, well MAYBE they will do something about it.

    Here where I work, it's simple, one spam, I ban a whole class straight off the servers, if one day I get a call because someone couldn't reach us (if they really need to reach us, we have a phone anyways!) I'll be sure to mention him Why. too bad this is not happening at the backbone level, because some people would get their act together fast and apply a legislation globally.

    • For once a restrictive legislation would get 99% support... you don't see that everyday. like I mentionned before, I don't get our politicians, they say they work for us, they try to find clever ways to tax us, remove control that we used to have and all, but something on which they would get unprecedented support, they are simply sitting on the issue...

      Perhaps the problem is that the law would gain them less votes then a few hundred thousand dollars in campaing financing would. A large portion of the population isn't online, and a large portion of those who are don't care about spam, so your politician doesn't care either.

      Since this is such a trivial technical problem to solve, it's not really a big deal either way. I daily reduce 800 spam messages to five or six that make it through to my inbox just using procmail scoring, and I haven't had a false positive in years. I spend five minutes updating my procmailsc every six months to keep it effective. I suppose that I could use an automated system to generate my score file similar to what Paul Graham described, but when I only spend ten minutes a year updating my rules, it's going to be alot of years before it was faster to have written all that code. No need for sweeping legislation.
    • While in many respects I agree that "There oughta be a Law" against spam, there are some problems with that approach. Not the least is that generally a social solution is much better (or at least has less side effects) than any law that a government will enact.

      Laws have the distinct problem of either going too far (false positive) or being too weak and thereby legitimizing the spam that would manage to work through the loopholes. Taken to the extreme that seems to commonly occur in the US legal system, I can envision spammers suing ISPs for blacklisting their "legit per US act ####" spam.

      I would much rather statistical methods such as are being discussed. This combined with "whitelist" methods seem to work very well by all accounts.

  • by rbrito (37104) <> on Tuesday September 17, 2002 @04:40PM (#4276216) Homepage Journal

    The timing of this article seems impecable, since I am myself trying to learn about Bayesian Statistics.

    I am a Computer Science student [] studying Computational Biology [] (more specifically, Sequence Alignments) and while I have a bit of background on Classical Statistics, I was (and still am) completely ignorant about Bayesian Statistics.

    It is only now that I'm trying to learn about Hidden Markov Models and its applications to Sequence Alignment that Ifinally decided to learn the basic hypothesis about Bayesian Statistics and how it differs from the hypothesis made by the Classical Statistics.

    During my searches for finding introductory material on Bayesian Statistics, I found this course page [] which has some nice introductory notes, including Bayesian Statistics.

    I hope that other people find this resource as useful as I did.

    • Here are some additional references, on-line & off, about Bayesian probability.

      On the web, see: Assoc. for Uncertainty in Artificial Intelligence [] -- this is the primary conference devoted to belief networks, which are a class of graphical (in the circles and arrows sense) Bayesian probability models. There are tutorials and other papers on the main AUAI web page, and links to the last several years of conference proceedings. By the way, Heckerman and Horvitz, now doing belief networkish work at MS Research, are in the AUAI crowd.

      In print, my favorite reference is E.T. Jaynes, "Probability Theory: The Logic of Science", which is due out soon. See this web site devoted to Jaynes' work [] for the status. I am also fond of Castillo, Gutierrez, & Hadi, "Expert Systems and Probabilistic Network Models".

      There are a vast (well, maybe just large) number of alternative models to classify things; a good introduction is Hastie, Tibshirani, & Friedman, "Elements of Statistical Learning". Incidentally, they use spam classification to illustrate several kinds of models.

      Finally, if you're wondering what the heck is the difference between Bayesian probability and any other kind -- just google the posts in sci.stat.math; there is a Bayesian vs frequentist flame war about once a year. :^)

  • by Jeffrey Baker (6191) on Tuesday September 17, 2002 @04:40PM (#4276220)
    I'd like to head the results of anyone who has implemented one of these probabilistic filtering systems. I implemented a modifed version of Paul Graham's system and so far it kicks ass. So far it has trapped over 600 spams without any false positives. I receive almost 100 spams a day and over the last week I have generally only had to delete one or two by hand. The rest go directly to jail.

    I'd like to hear about modifications to this system. I removed Graham's doubling of "good" word frequencies, and I trained my filter using digrams. I also tried all the various methods supplied by the program "rainbow", with good results, but the implmentation was too slow and klunky to place in the middle of my email delivery system. What are other possible modifications?

    • by ajm (9538)
      Just out of interest what's your code written in and would you consider posting it?
      • by Jeffrey Baker (6191) on Tuesday September 17, 2002 @04:56PM (#4276353)
        I hacked it together in Perl, to make use of the Berkeley DB interfaces and the MIME parsing modules. Took about 30 minutes. I'm working on a C library that could be linked into mutt or pine or whatever, but I'm finding the available MIME code in C cumbersome.

        You can grab the source here [], but it is specific to the exact way that my mail gets delivered (via offlineimap into maildirs).

    • by kwerle (39371)
      I implemented Paul's system without the changes you mentioned, and am seeing >95% success (and climbing). 0 false positives. I will be submitting it to sourceforge this week.
      • So have you been retraining the system as you get more spam, or did you train it initially and leave it that way. How large is your training set?

        Details! My training set was 300 spams and 3500 not-spams. With digrams, my filter traps 618 out of 621 spams in my spam folder, which is 99.5%

        • by kwerle (39371)
          So have you been retraining the system as you get more spam

          I continue to train.

          or did you train it initially and leave it that way. How large is your training set?

          I started off with a base.

          Details! My training set was 300 spams and 3500 not-spams.

          I started with a little more than 300 spam, and around 1000 valid messages.
          My count is now:
          Good messages read: 1194
          Bad messages read: 644

          That's because I only train on deleted mail, and I don't tend to delete my mailing lists except for once a month or 2...

          With digrams, my filter traps 618 out of 621 spams in my spam folder, which is 99.5%

          Against my start set, I nailed about 97%, including refiling 2 false positives from my old anti-spam system as being not spam. I've noticed that the system is really good at nailing stuff it already knows about, but the learning curve is a little steep for 'new spam types'. Still, I'm pretty happy with it.
    • I don't understand the filtering software that I can download to implement this system. Anyone have a link?
    • In addition, is anyone running probabilistic filtering on a large system with lots of users? Say, 1,000,000 messages a day? I'd be curious to know how you do it while keeping your load down on your mail machines.
      • One person's spam is another person's 'useful email'. For instance, I may want a particular type of email (eg: a pr0n mailing list, or a "George Foreman Grill" user group, or lots of Korean friends). It might be considered spam by the ISP's filters, but not by me.

        That's why it's best to train _my_ filter against _my_ received mail.

        And as more email gets received and I add the uncaught messages to the spam filter, my filter 'learns' what I consider spam.
    • by XDG (39932)
      I've implemented it in part -- my code is in perl and will flag e-mails, but I haven't worked it into a filter yet.

      My experience is that I get a few percent false-negatives and about 1% false positives. I'm not seeing zero false positives, like many people are, but that probably has to do with the training sets used. Statistically speaking, you always have to trade off false negative with false positives, so it's reasonable in my 'real world' tests.

      As a side note, everyone should test out of sample. E.g. set aside half your good e-mails and half your spam e-mails, build the filter on one half, and then test on the other half. That's the only way to get a fair test of the filter.

      For my "good" email corpus, I dumped my entire e-mail archive since 1995. That included personal e-mail, receipts from online shopping, some mailing lists, etc. The few things that get flagged as spam (a) are almost always sent in HTML format, and (b) very short with little real content. (E.g., "Hey, looking forward to seeing you this weekend. Call me if you go out. My number is... Bye.")

      The spam corpus I took from on online resource while I build up my own. The e-mails that slip by unflagged are usually (a) short and (b) phrased like friend making a suggestion. (E.g., "Hi, I just thought you'd be interested in hearing about a this new, cool website, http://...") It seems to be close enough to a real message to slip through. Thankfully, few of them are like that.

      I'm including subject lines, from addresses, and the body so far. I'm not parsing ip addresses or html tags specially, however, just basic words using a simple perl regexp.

      Interestingly, "COLOR" is the one of the most often flagged words indicating spam. HTML formatting text seems to be the biggest culprit in my false positives. I might explicitly exclude the ones that show up in good mail (e.g. from friends who use crappy e-mail programs like aol) like COLOR, FONT, FACE, etc., but leave in the ones that spammer use like TD, TR, etc.


    • I'm surprised to only see one link to Bogofilter [] in this discussion. I started using it just a couple days ago, training it from scratch (because I am patient and lazy). I train it on all emails it doesn't mark as spam, and then retrain it on spam it misses. So far I'm up to catching about 50%-75% of spams (climbing rapidly), with one false positive (though I had to read through that email a couple times to realize it wasn't really a spam -- so the false positive is understandable, since as a human I could have made the same mistake).

      Bogomail potentially captures more relevent words than as described by Graham -- IP addresses, email addresses, and other text that should be considered atomic are recorded atomically. I think even more could be done with this -- but I worry that bogofilter is going to create too large a database, as it even seems to be keeping track of words like "$20".

      As an optimization, I could imagine you could double-register some words, mostly those in headers. So the word "mother" in a subject line might register both "mother" and "subject:mother". Perhaps IP addresses could be recorded with all their classes (e.g., "" would be recorded as "", "200.69.228", "200.69" and "200" -- maybe prefixing some text to the last three, so that "200" the number is distinguishabe from "200" the class-A address)

      Ultimately, a well trained spam database could be trimmed and distributed, but bogofilter does not yet include such a database. Graham's concern about distribution and trust are, IMHO, not entirely necessary -- a well-trained database can be created by only a handful of people (who receive lots of spam), and even if non-spam must be classified on an individual basis, spam is not tailored to any individual (nearly by definition). I don't think this has as great a risk of censorship as blocking lists.

      I would be interested to see an improvement in the client end of bogofilter (or similar software). Right now I'm using procmail, and forwarding miscategorizations back to myself with a changed subject line (which procmail catches and feeds to bogomail). With just a little work, this could be used to create filters besides spam, where I train bogofilter to filter based on other criteria. (Well, I can do this right now, but it would take only a little work to make this accessible even to computer novices)

    • Lot of implementations mentioned in this thread, but does anyone know of an implementation for the most wildly used E-Mail clients under Linux/BSD: KMail, Evolution and Mozilla?

      TIA for any links.

      Bye egghat.
  • by ajm (9538) on Tuesday September 17, 2002 @04:41PM (#4276223) in the eating. I think the same applies to spam. Paul showed, to his satisfaction, that the technique he used worked for his samples. Gary proposes some changes that would improve the filter's accuracy, but does not test these theories.

    We will now have many slashdot posts saying "I've not tested this but I think A (or B, or C, or X)"

    Here's where the scientific method comes into its own. Anyone who cares enough can actually test and post their results. I'd be interested in seeing what they look like. I don't have a database of spam to test against (and please don't volunteer to sign me up for some :) but it would be interesting to see whether what looks convincing in theory pays off in practice.
    • From what I can observe from the writeup, Gray appears to be one of the "experts" that I refer to as "theory whores". Hard problems need to be tested, but some people seem to think that they can arrive at good results from an unproven theory. Anybody who has actually tested difficult problems to any extent could tell you that things don't always go as planned. An improvement with might work in theory, sometimes results in disaster due to minor points that the theory does not take into account.
      Also, it bothered me that he objected to Paul's work biasing one side. It was almost like he thought it was a bug, but there was a good reason for biasing (reduce false positives). So my advice for Paul is, until you actually implement your idea, don't go trying to say that it is better than somebody else's method.
      • It's the feeling I got as well. The change Gray suggests might be good or they might be bad. What I liked about Paul's write up was that he determined practically that it worked for him. Sure the theory may not be perfect but, to use a broad analogy, you don't need general relativity to work out how fast a ball rolls down a slope, Newtonian theory works fine. It may not be worth it to get the extra .1% of accuracy, especially if, as you point out, it increases false positives. Only testing will tell.
      • by Anonymous Coward
        Paul called his method Bayesian. It wasn't. In addition to just pointing out that Paul was wrong, Gary also outlined how one might take a Bayesian approach to the problem.

        He also showed how his extended solution included Paul's as a special case.

        It sounds like you frequently get terminology wrong, and when someone points out that you're using the term incorrectly, and further shows how you could actually apply what you were talking about to the problem at hand, you go off on them for being a "theory whore." You're the winner of today's "Slashdot personified" award. Congratulations!
    • You, sir, have made my day; if I have to hear one more chucklehead say "The proof is in the pudding" I will not be held accountable for my actions.
  • by saskboy (600063) on Tuesday September 17, 2002 @04:41PM (#4276225) Homepage Journal
    I have some tricks for Hotmail users who cannot benefit from the technique above:
    Filter any message without the @ in the address.
    Filter Britney, Boobs, Penis, Inches, WIN, ___ ..... and your own email address userid.
    Now you only have about 40 spams a day to deal with instead of 100.
    Uncheck your information from being in the MSN directory too.

    Enjoy :-)
  • by DonkeyJimmy (599788) on Tuesday September 17, 2002 @04:45PM (#4276255)
    It's good that work is being done to make a good weigted spam filter.

    It's funny how bad the standard Microsoft spam filter is (the one present in outlook). It's simply a word lookup, where if the word is present the message is marked as spam. It looks for things like "for free?". You can see the full list here [], near the bottom. It's a little old, but not outdated (I think you can upgrade your spam filters, but I tested these, and the ones I tested work).

    The adult filter isn't any better.
  • by Anonymous Coward
    Finally it is worth mentioning that if you really want to go a 100% Bayesian route, try Charles Elkan's paper, "Naive Bayesian Learning". That approach should work very well but is a good deal more complicated than the approach described above.
    Here is the article [][]
  • Let's see (Score:5, Funny)

    by sam_handelman (519767) <`ude.aibmuloc' `ta' `3002hks'> on Tuesday September 17, 2002 @04:49PM (#4276287) Homepage Journal
    P (This is spam) = P (This is Spam | It will enlarge my penis) * P (It will enlarge my penis)

    Now, given that I have prior knowledge that:
    P (It will enlarge my penis)

    is very low,

    and given that, having never encountered anything which enlarges my penis in any permanent way, I have no knowledge of
    P (This is Spam | It will enlarge my penis)

    and we have the product of one probability which I know is low, and another of which I have no posterior knowledge, so we conclude that P (It is Spam) is also low, and that I must have requested more information on their new penile enlargement technique.

    So, that message goes into the keepers.


    P (It is Spam) = P (It is Spam | Frank is getting maried) * P (Frank is getting married)

    So, I know frank is getting married, since he sent me this e-mail I'm considering filtering as Spam, and weather or not it is spam is pretty much independent of whether or not frank is getting married, so.... it's Spam. Away it goes.

    P.S. I've deliberated made a hash of this for a joke. The actual rule is:

    P (A & B) = P (A | B) * P (B)
  • by Anonymous Coward
    Is this what the new in Mac OS X 10.2 uses?

    I, myself, am not sure but the new is smart and it does learn. After a week of "learning" it has correcly determined messages as spam more than 99 out of a 100 times.

    • I've see the same effect. I almost never see spam anymore, and get almost no false positives. I wish I could find the quote, but one of the Jaguar engineers confirmed that there's some "serious math" behind's spam handling. I can't find any technical reference about the algorithms being used, though.
  • by frovingslosh (582462) on Tuesday September 17, 2002 @04:50PM (#4276299)
    Sadly, unless you are an ISP or other mail service provider, filtering does nothing. The spammers work in volume. They count on hitting everyone to reach that .1% that will respond. That response is what they are after and what they get paid for. You likely know better than to ever deal with anyone who spams you or to ever respond to their spam. Filtering your own e-mail has absolutely no effect on the spammer, you were not going to respond anyway. By the time you filter they have already wasted your bandidth, and perhaps mailbox capacity and even forwarding limits from a forwarding service. Your filtering is useless, puny human!

    Here is a suggestion for something that might make an impact on spammers: IF I open my firewall, I see several attempts a day from people trying to get into my mail server. Of course, I don't have a mail server, but spammers are always looking for open relay points they can spam from. My suggestion: Give the a nice open relay server they can send mail to. Of course, you don't want to piss off your service provider by sending spam, and your upstream speed might limit you to less than you can receive, so rather than run a full mail server lets modify some mail server code to just accept mail and send it to the bit bucket. Maybe we can even misconfigure existing code to do this with no programming changes.

    No valid user will be affected, assuming you don't otherwise run a mail server. All that bandwidth you pay for can be used to receive e-mail from spammers before it ever goes out. Eventually their customers will see the response go from .1% to 0% and their business will dry up. This will impact spammers, blocking your own spam after it's been delivered will not.

    This need not even impact your own bandwidth. You can run the server when you are done using your system (Might make a nice screen saver - a black screen that just shows how many spammed addresses were prevented from getting spammed). Or you cam impose limits on bandwidth at a firewall or router, or even restrict hours of access.

    If we set up enough different false open relay servers I think we could have a real impact on the spammers.

    • This need not even impact your own bandwidth.

      Last week (I can't find the article yet), Slashdot had a link to a column by someone who was (in his opinion) unjustly blacklisted for hosting an easily-accessible mail server. The moment his name hit that blacklist, he became a target for what may as well be every spammer on the planet. Even though he didn't actually have an open relay (just an easily-guessed password), the incoming traffic from so many e-mail spammers effectively brought his server to its knees. Changing his domain name and IP address was the only cure.

      Building a "honeypot" mail server for spammers is appealing, but could be more trouble than its worth, especially since it's more or less irreversible. I'd advice against it.
      • He was running an open relay. He was too ignorant to know it.
      • I've always wondered what it would take to modify a sendmail or postfix configuration to become a "mail sink". Sure, there are tarpits that slow spammers down, but why not make a server that acts and smells like an open relay, but simply dumps the mail to /dev/null and tells the sender they were delivered? Maybe bandwidth might be an issue, but it may more effective than a tar pit.

        A human watching over his spam software might notice if the target relay is delivering at a rate of 1 message per day and find another. If, however, he sees that the "server" is ripping through deliveries at a massive rate, he might stay with that server and all of his spam will vanish into the bit bucket.

        • ...and into a Bayesian filter mangler, providing it with a diet of 100% unadulterated spam. The filter can then be distributed a la virus updates...
        • Google and Googlegroups for spam honeypot.

          And search Slashdot too. I think there was an article about a Russian honeypot a few months ago. Because of bandwidth costs, they "throttled down" their honeypot to reduce the truely huge amount of hits by clueless spammers. (But I repeat myself..)

          There are arguements both ways about relay honeypots. The downside is that you have to let some relay tests go through so that when the spammer tests it, the tests go through. But then when the actual spam-run happens, it has to choke it off completely.

      • Actually, he did have an open relay, he just wants to hide behind a lame claim that it wasn't open because the spammer had to lie to use it! Imagine that, a spammer lying. He was a lawyer, and we know they never lie ;-). IMHO he got a lot less punishment than he deserved.

        And what was the reported problem he cried about? Not an overload on his network, that was not his complaint. But his domain name being blacklisted. With good reason, IMHO. He was running a server that spammers used, and could even see this when the people he invited to test his system got right in. He then claimed they misused his system because they gave a false name and suggested he should sue them!

        Maybe this guy was just too stupid to block a port on an incoming firewall to keep the outside mail server users out. It seems unlikely though, particularly if he had the ability to set up a mail server (supposedly for the use of his own local network). It sounded more to me like there was a good chance he knew exactly what he was doing and wanted to set up a server for spamming, and was blowing smoke when he got black holed.

        Getting black holed will not be a problem for a dummy server that never actually sends mail (the black hole people are not out there port scanning like the spammers are). Even if your dummy mail server were to be blacklisted, so what? That in no way would affect your normal e-mail that you send through your service provider.

    • Interesting idea, but easy to verify. Send one thousand emails, and include a verifiable email in it. Check the email a few hours later - if it's not there, then don't use the relay.

      • Interesting idea, but easy to verify. Send one thousand emails, and include a verifiable email in it. Check the email a few hours later - if it's not there, then don't use the relay.

        Sure, they might test it. Still seems better than doing nothing. If a spammer passes me 1000 pieces of mail and waits a few hours, that's 999 pieces that didn't go out and a few hours of his time. If only I do this it will have little impact, but if the slashdot effect kicked in and there were so many false servers that it kept happening to him over and over again, that would be sweet!

        And of course, some spammers will be lazy and not test. Jackpot!

        Of course, the servers should look different. Some Linux, some Windows, some something else. Claim to be different applications. We might even start building smarts into the servers (if you get only one email, and it's going to an address that is likely a test address (his own domain, a mailbox service like Hotmail, or a local ISP that serves the same area his packets came from), wait one minute and then send it on. Worst that can happen is your false relay gets blacklisted (not a problem).

        The bottom line is, which will have any impact on spammers, a lot of false relays out there that discard their e-mail destined for victims that keep the system going, or filtering e-mail that you were never going to read anyway?

    • Filtering your own e-mail has absolutely no effect on the spammer, you were not going to respond anyway.

      You're missing the point. I could care less what the spammer does. The benefit is that with a good filter, I don't have to look at spam. Currently I spend maybe 15 minutes a day recognizing and deleting spam emails, and occasionally screw it up and delete something important by mistake. If a filter program can reduce that load, it's useful to me even if it doesn't stop the spammer from spamming.

      And in any case, in a year or two, when such intelligent filters are a standard feature on AOL and Outlook and etc, the spammer's "hit rates" will likely drop dramatically, at which point they will have less incentive to spam.

    •'s approach [] seems like a good implementation of this.

      --Phil (Sadly, I'm on a cable modem, so I don't have the bandwidth for this.) Gregory
      • No. It's another good thing that can be done, but it's not what I'm advocating. Basically he has set up a mail server expecting to get mail for his own addresses. He then wastes as much time as he can of the open relay the spammer is using. This at least slows things down for the spammer, and he might just get the attention of someone in charge of the server. Aother good thing to do, I think we need to do lots of things like this to stay ahead of the spammers. The dummy open relay would be another but different tool. Rather than slow down the conection it should take as much spam as it can, so that it doesn't go elsewhere and so that the paying client of the spammer would eventually see that he is getting no results.
    • I've written a relay honeypot in Java. It's a real relay, that relays messages only if:
      • there's less than recipients (configurable);
      • the relay request arrived not less than seconds after the previous request.

      It can bounce messages addressed to the local machine, in case the spammer checks for bounces (buggy, at the moment).

      It whitelists relay test-addresses, as specified by its operator, and relays to those addresses even if it thinks it's in a spam-run. It adds any address to which it relays to its whitelist (i.e. it collects relay-test addresses).

      It also posts all the data it collects to a website, which it can serve itself (i.e., it's a webserver too).

      It has quite a number of other frills (not all of which are documented yet - it's still in test, but it's getting more stable every day).

      It is a valid objection to a honeypot that does relay test-messages, that it is sending spam. There is a risk of the program being subverted by a spammer. Honeypotting this way isn't for children - you could get complaints for running this program.

      Having said that, you can download the current Beta build at My site []. (Damn, how do you get rid of that crap in square brackets???) It's highly configurable, but it runs out-of-the-box on Win NT/2K/ME systems (it needs a JVM, of course).


  • by ShakaUVM (157947) on Tuesday September 17, 2002 @04:53PM (#4276323) Homepage Journal
    At UCSD, Bob Boyer and I wrote a neural net spam filter. Neural Nets, as everyone knows, are not really like biological brains, but really just statistical engines similar to the approach the guy above claimed to do.

    Our approach worked pretty well (95-97% accuracy), and we had to deal with the same issues that the above "Bayesian" approach did. I.e., weighing the neurons so that false positives occur much less frequently than false negatives, etc. We built it using data on spam collected from the UCI machine learning repository.

    It ties in with procmail. I'm not really a windows guy, so if anyone knows how to put a filter between an IMAP server and Microsoft Outlook/Netscape Communicator, I'd be interested in hearing how it's done.

    The README for it is at:
    And you can download it at: ar.gz

    -Bill Kerney
    wkerney at
    • The thing that struck me here is that you chose the 57 attributes up front and determine the value of these attributes for each spam. These values are then the input to the spam.

      How did you arrive that these attributes? Are there any others you considered but didn't include?

      Is there any way a nueral network, or somethng else, perhaps could b used to determine other, less-obvious attributes? For example, Paul's filter found that the color #ff0000 (bright red) was a high indicator for spam. While that is the value of an an attribute (value = red, attriibute = color) that is the sort of unanticipated tell-tale sign of spam I'm referring to, except I wonder if there are unanticipated attributes to be found.
  • SpamAssassin - duh (Score:3, Interesting)

    by Gothmolly (148874) on Tuesday September 17, 2002 @04:55PM (#4276346)
    SpamAssassin [] works great for me. It eats about 90% of my spam, you just hack up a little procmail file for it, and you're done.

    With so many people using SpamAssassin these days, I can't see how this is a timely or newsworthy item. More like from the been-there-done-that-dept..
    • Reasons why I don't use SpamAssassin:
      1. It tends to rely on blocklists, many of which have demonstrated unfair practices in the past.
      2. The more SpamAssassin is used, the more spammers will specifically avoid doing things SpamAssassin checks for.
      3. It's a gigantic heap of perl, the Write-Only (tm) language. I hate the fact that every perl program demands I mess up the package manager on my system by blindly downloading a half-dozen new modules. And it's slow!
      4. Bogofilter [] is better. duh.
      • It tends to rely on blocklists, many of which have demonstrated unfair practices in the past.

        True. Spamassassin does use block lists as part of the score, but you can lower the scores for those, not use them at all. The scores aren't high enough to kill a message by itself, I believe the highest score for a block list is 3.0 with the default threshold being 5.0.

        The more SpamAssassin is used, the more spammers will specifically avoid doing things SpamAssassin checks for.

        And if spammers decide not to send me pr0n or other crap, that's a bad thing?

        The only real problem I've had with SpamAssassin lately is that I'm stuck on version 2.20. My ISP needs to upgrade Perl before I can run more recent versions. :-(

        I'm not a big fan of Perl either.

      • It tends to rely on blocklists, many of which have demonstrated unfair practices in the past.

        I've turned them off; it's still 95% effective.

        The more SpamAssassin is used, the more spammers will specifically avoid doing things SpamAssassin checks for.

        As spam changes, so does SpamAssassin. It includes phrase frequency checks etc, too.

        It's a gigantic heap of perl, the Write-Only (tm) language. I hate the fact that every perl program demands I mess up the package manager on my system by blindly downloading a half-dozen new modules. And it's slow!

        Oh, hell, yes. It's really quite nasty code, badly speghettified and relying on things like looped evals.

        Look at SpamAssassin/ -- it performs (body regep rules * body message lines) regexp matches per message. It doesn't take long to see how nasty that'll get on a large message with over 200 rules :)

        Bogofilter is better. duh.

        Mmm, might have a look at that, thanks.

  • While I love everything there is to love about open source (code and ideas), I kind of worry when I read how successful all these new Bayesian/Grahamian filtering techniques work.

    Not being a coder or statistician myself, I'm left wondering if the spammers can exploit it for a workaround. Is there something "built in" to these filtering techniques that can be used by spammers to effectively circumvent them?
    • > Is there something "built in" to these filtering techniques that can be used by spammers to effectively circumvent them?

      Yes and no.

      To defeat a bayesian filter, the spammer needs to make his email contain similar words, and combinations of words, to your genuine email, while at the same time making sure that the words used are different to those in known spam.

      So saying 'click here to make $$$' won't work any more, since most of your regular emails don't contain the word combinations 'click here' and 'make $$$', whereas known spam emails will.

      However, we're already beginning to see spammers making their emails less obviously spam.

      For example, the spammer may use an email along the lines of:

      "How's things?

      Have you seen yet?

      Don't forget to mail me those documents.

      A Spammer"

      Even a bayesian filter will struggle to distinguish that from:

      "Have you seen the story on slashdot yet?

      Don't forget those reports.

      Your Boss"
    • I bet they could get around it by picking a few random words from a dictionary and adding it to the end of the spam. If one of them were an obscure word that you've received in one or two legitimate e-mails, the filter would decide "Hey, I've never gotten a spam with the word 'yarborough' in it before, so it must be real".
  • Well... (Score:2, Informative)

    by ccarter (15555)
    I hate to give any kind of credit to M$ but they patented the idea of using Bayesian analysis for spam filtering circa 1995. They even had it in one of thier beta's. However the filters were tagging some of those fricking Blue Mountain greeting cards as spam (imagine that!) so Blue Mountain sued them on anti-competitive grounds and M$ pulled it. Blue Mountain wanted to have the spam filters universally pass Blue Mountain content but MS refused that on the grounds that if a user considers it spam then it is in fact spam to them (Hurray for the "bad guys"!). The law suit has been settled/dropped/died for reasons I don't know.

    Anyway I hear that the next version of MSN will have a Bayesian filter and that it will be introduced in an up coming version of Outlook Express (no idea about Exchange and Outlook).

    BTW I believe internally MS uses this technique for spam control and that they don't seem to have any spam problems.
  • by operagost (62405)
    Note to statisticians: the product of the probabilities is monotonic with the Fisher inverse chi-square combined probability technique from meta-analysis. The null hypothesis is that the probabilities are independent and uniformly distributed.
    Ouch! My brain is hurting, Doc!
  • sites like yahoo, hotmail, etc are in a unique position to rid their users of spam.

    i don't see why they cant implement some system that scans incoming mail for its users' mailboxes, maybe does a checksum for each message or something, and if it finds that a number of its users are receiving exactly (or nearly exactly) the same message, assume it's spam. nuke the messages, and any new incoming ones.

    yeah, if such a system only scans a small number of mailboxes, it may filter out mailing list posts and so on. but it gets more and more reliable the higher number of mailboxes it tracks.

    this avoids searching for certain keywords and eliminates false positives. after all, how well would these keyword searching methods do if i were to quote a spam message in an email to a friend?
  • by XDG (39932) on Tuesday September 17, 2002 @05:18PM (#4276531) Journal
    Gary is both right in some respects and irrelevant in others. Here's the key line in his article that deflates it a bit:
    It is untested as of now. It is based purely on theoretical reasoning. If anyone wants to try and it test it in comparison to other techniques, I'd be very interested in hearing the outcome.
    On the other hand Paul Graham has actually tested his model and it works. I've worked it up in perl and tested it on my own data set and it works there, too. Paul acknowledges that he's being a bit fast and dirty, but the proof is in the pudding. The rest is just academic quibbling over the fine points.

    I'm not sure why this particular article needed to be posted, as it's just one of several alternative approaches and an untested one at that. On Paul's page, he also lists several published academic papers with other alternatives -- all actually tested, of course.

    Gary is basically right in questioning the use of the word "Bayesian". Paul's approach is more about weighing "evidence" as given by the appearance of certain words, rather than in figuring out the probability of spam assuming a "prior". See Paul's explanation [], but if you check the article he references at the end, you'll note that the method Paul uses is only one of several methods to solve an underspecified problems. It's a reasonable guess, not necessarily the only guess.

    Looking at another article [] Paul references, given the word independence assumption, the more formal Naive Bayesian approach calculates as follows:
    p(spam) = [ p(spam)*p(word1|spam)*...*p(wordn|spam) ] / [ p(spam)*p(word1|spam)*...*p(wordn|spam) + p(!spam)*p(word1|!spam)*...*p(wordn|!spam)]

    This is similar to Paul's approach except for including a "prior" assumption of p(spam) -- the expected probability of any email being spam, calcuated from the historically observed frequency of spam. By leaving it out, Paul implicitly assumes that 50% of mail is spam -- that's his "prior" estimate of the spam rate. Given the other adjustments he makes to his sample, that appears to be acceptable in practice. (Paul overweights the spam prior, but also overweights the effects of "good" words.)

    I'd personally prefer to overweight the "good" e-mails entirely rather than just put a "good-multiplier" on them like Paul does, but that's just quibbling over small bits.

    As to the bit that Gary raises about Paul assuming a spam probability for an unknown word -- Paul originally said .2, then revised to .4, but really should have put it at .5 or just excluded it from all calculations. A new word has no robustness as a predictor (which is why Paul dropped words that didn't appear five times anyway). In practice, a new word at .4 isn't going to be among the 15 most interesting words to make the calculation from, anyway.


  • How long until we throw out the current e-mail system.

    I own my own domain, which makes it easier, but we really need a system designed to filter. And make it easier. This is my uninformed proposal. Perhaps it won't work, but it seems something is needed.

    People should have a private/public e-mail address. They should all go the same "account" and be part of the basic plan for any e-mail user.

    I know this is important and relevant

    I gave this person my e-mail address will go into the crap bin and be deleted eventually. Perhaps some program could be used to alert users of possible important mail pieces there.

    Then we could also have some system to CHANGE the private authentication or public authentication that is form based. I.e. This address has been disconnected. Please apply for the new password.

  • by Have Blue (616)
    Apple's new spam detector works amazingly well for me. After some initial jitters it pretty much never gets false positives these days.
  • microsofts trademark (Score:3, Informative)

    by portal9 (518319) on Tuesday September 17, 2002 @06:02PM (#4276940)
    why are we even considering this method when microsoft has a trademark on it? nothing can be done.. they have a lock on it. trademark here []
    • minor nit-picking detail: they have a patent. patent and trademark are completely different.

    • by thogard (43403)
      another stupid patent? This isn't new, its been done with spam on usenet for years. Maybe someone should digout the cancelmoose's freiends as prior art?
    • 1. it is a patent, not a trademark
      2. just because someone has a patent doesn't mean the patent can't be challenged.
      3. just because someone has a patent doesn't mean a patent will be enforced.
      4. Some things are worth fighting for
  • All it says in the help is that it is adaptive and trains itself on your previous spam. It would be nice to see some source... and be able to patch it if we don't like it.... oh well, whining won't get me anywhere.
  • Let me start by saying I know very little about coding, otherwise I'd probably already be rushing off to a night of coding by the glow from my monitor.

    When the first Bayesian spam filtering article was posted, I thought it was a great idea, and this article just reinforces that idea. However, it would be interesting to build some sort of Sendmail module (or whatever MTA you like), but add some additional functionality:

    1. Option to return a 550 error if the message is determined to be spam: "550 Delivery blocked; Bayesian filter reports spam probability of nn%"
    - Right before reporting this error, wait n seconds or alternately, slow connection to n bps for n minutes.
    - After reporting the error, "deliver" the Subject and Body of the email to the spam words database.
    2. Inclusion of a whitelist, by IP, reverse DNS, MAIL FROM address, or RCPT TO address, header To: address, header From: address, etc.
    3. Configuration of account where spams can be forwarded to, for automatic addition to the database.
    - Perhaps this could be combined with the blacklist/whitelist. For example, any emails to are always added to the DB. The entry could be as follows (similar to the Sendmail access map): <tab> BAYESIAN:SILENT
    - This would allow for either silent addition to the filter (sender thinks mail was delivered -- good for spam harvesting emails, or for users to send their spam to), or a more "vocal" addition much like item #1 above, where a 550 error is reported... eg, BAYESIAN:550 or perhaps BAYESIAN:REJECT

    I realize this would block a lot of mail, but I have my Sendmail currently configured to actually block spam (or what it considers spam) and have had very few issues with valid messages bouncing. Obviously, results may vary, but I'm a firm believer in rejecting spam during the SMTP conversation, not accepting it and then deleting it silently.

    Does anyone else have any suggestions?
  • This whole methodology is already patented by Microsoft. ANY implementation not licensed by Microsoft is going to be a violation... And now that you know, it is treble damages...

    patent 6,161,130 []
  • Seems like everyone jumped on the bandwagon and implemented a bayesian spam filter shortly after Graham's article hit the net. Best part is, theory or not, the damn thing actually works.

    Paul's article lists a few of the bayesian spam filters, but here's a short list of the ones I've tried:
    Gary Arnold's bayespam [] is implemented in perl and geared towards qmail using maildir storage.

    Brian Burton's spamprobe [], written in C++, tries to remember already-seen messages, so that you can dump your spams/good mails on separate folders, have spamprobe learn from them, and delete them afterwards. Spamprobe remembers which ones it already processed, and won't reprocess a message if it's already seen it.

    Eric Raymond's bogofilter is a typical ESR tool: concise, with a baroquely written man page, and quite simplistic, but does its job and does it well. ESR even uses some funny terms, like "spamicity", and "ham" (the opposite of spam). I don't like its dependency on the Judy libraries for dynamic arrays but what the heck.

    Matthew Walker's BayesSpam [] plugin for Squirrelmail provides squirrelmail users with bayesian spam filtering capabilities, no longer restricting use of the technique to those with access to procmail/mailfilter systems.

  • You can download the source here [] if you like.

    It's not from the same guy, but it's definitely derivative work.
  • IIUC, The proposed method normalizes (with Ln norm) over the number of words, for "spammishness" and "unspammishness" of words, combining the results.

    whats stoping the spammers from attaching, say, a random scientific article longer than the spam at the end of the spam message ? This will give the spam a high grade in these bayesian method in general, but more so with his normalizing metric.

    • It depend if the bayesian method is naive or not.

      If for the top 1000 highest and lowest words you build a pair-wise table of each highest paired with each lowest and keep probabilities for these pairs as well then that would solve the problem.

      For regular email that contain no spam words, no problem. For spam that contain only spam words, no problem. For spam that contain both kinds of words the pair table would catch them.

      I mean, how many valid emails can you possibly have that both have scientific terms and words like "hot teen sex." Unless, of course, it's a scientific study about either spam, or hot teen sex.

      Unless, of course, I'm completely wrong about this whole thing and just don't realize it, which is sometimes far more likely than I approve of.

      Justin Dubs
      • I mean, how many valid emails can you possibly have that both have scientific terms and words like "hot teen sex." Unless, of course, it's a scientific study about either spam, or hot teen sex.

        well, for "hot teen sex", or "novel penis enlargement techniques available today !!" spam, I guess you're right. but for "get your mortguage now!", or for "cheap toner at amazing prices!" kind of spam this seems more tricky.

        Unless, of course, I'm completely wrong about this whole thing and just don't realize it, which is sometimes far more likely than I approve of.

        I'm no expert either, just skepticaly paranoid ...
  • I lkie the soun of this one - seems like it should eliminate all false positives sent by real peope and all false negatives. I worry about auto-responders and auto-reminders, though. TMDA (Tagged Message Delivery Agent) []

FORTRAN is for pipe stress freaks and crystallography weenies.