Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Spam

Sorting the Spam from the Ham 249

MrClever writes "The Sydney Morning Herald (Aust) is running an article about the merits of Bayesian filtering and a good plain-english description of how it works. Might be handy if you need to explain it to non-technophiles. The main thing that may be useful is a Bayesian spam filter written to drop straight into Outlook 2k/XP available here and written in Python by Mark Hammond." Math buffs might enjoy reading these pages or browsing this writeup and its many links.
This discussion has been archived. No new comments can be posted.

Sorting the Spam from the Ham

Comments Filter:
  • by pnix ( 682520 ) on Thursday June 26, 2003 @01:29PM (#6304241) Homepage
    But without spam, I wouldn't get any email!
    • You're getting modding as funny, but I just figured out exactly how true this is. My main email account is used primarily for work, so it was very easy to set up white lists for 30 or so email addresses with a few family and friends thrown in, and route to a special folder. I still check the default folder, of course, but I turned off notification for everything except the white folder.

      I went from checking my email every 5-10 minutes to a handful of times a day.
    • Yeah 95% of my incoming emails have this subject:

      Subject: You Blocked My MSN.

      I now have a new purpose in life. To reply to each and every one of them, even if the address exists or not.

      RE: You Blocked My MSN

      BECAUSE YOU KEEP SPAMMING ME ASSHOLE!!!
  • I get to do something to stop my boss from enlarging his penis anymore... It's really starting to hurt.
  • Why not here? (Score:5, Interesting)

    by Anonymous Coward on Thursday June 26, 2003 @01:31PM (#6304282)
    What happens if Slashdot runs a Bayesian filter which runs a day after the stories are posted and programs itself with all the -1 comments as "Spam" and all the +5 comments as "Ham". Then let the Bayesian filter adjust all incoming messages by up to 2 points.

    I bet it'd work - and imagine if we did it to stories too! Maybe it'd reject all Taco's dupe submissions.
    • Re:Why not here? (Score:5, Interesting)

      by bmongar ( 230600 ) on Thursday June 26, 2003 @01:59PM (#6304595)
      Very interesting but I think it wouldn't work well, since most of the trolls and flamethrowers are talking about the same topics the same words will show up in both ham and spam posts. But if someone could come up with a word pattern algorithm that could differentiate that would rock.
    • Re:Why not here? (Score:2, Insightful)

      by kels ( 9845 )
      I bet it'd work - and imagine if we did it to stories too! Maybe it'd reject all Taco's dupe submissions.

      Umm, a naive Bayesian filter would score duplicate posts highly, because after all they contain all the same words that were good last time.
  • What I want (Score:5, Interesting)

    by Nate Fox ( 1271 ) on Thursday June 26, 2003 @01:32PM (#6304285)
    is a scalable popfile [sf.net] for larger organizations. If I could get popfile (with its super-easy-to-train/use-web-interface) that would run on my linux server, scan my IMAP mail server (well, incoming mail would actually work fine, too. I've heard they have a smtp plugin for it in cvs), and then have a popfile config page for each person, or mayby tie it into the imap/smtp server's login. THAT would rock. I've heard spamasassin does Bayesian, but I couldnt see how it was trainable (and I dont want other people on my server to read each others mail, obviously).
    • What I want is physical pain upon the sender whenever spam is sent. *THAT* would be much better I think :]

      Hell, even a fee or mental anguish would suffice...
      • There 'ya go! What we need is some good old fashioned ass-whoopin' corporeal punishment.

        I'm personally a proponent of tar-and-feathering, but that's just me. After a few times walking around like a deranged Big Bird, I think spammers might find a real job.

    • Re:What I want (Score:4, Informative)

      by franimal ( 157291 ) on Thursday June 26, 2003 @01:41PM (#6304393) Homepage
      Personally, I really like Spambayes and Procmail for use with my IMAP server. It's easy to setup for each user and they can train their own SPAM database. You can even run the training script as a cron job and the users only need to shuffle unknowns to the spam folder. Works well, because users never even have to see the spam, if they don't want to.
    • is a scalable popfile for larger organizations. If I could get popfile (with its super-easy-to-train/use-web-interface) that would run on my linux server, scan my IMAP mail server (well, incoming mail would actually work fine, too. I've heard they have a smtp plugin for it in cvs), and then have a popfile config page for each person, or mayby tie it into the imap/smtp server's login. THAT would rock.

      Actually, I would love to have the same thing. Popfile is all Perl and open-source, so it could probably
    • Re:What I want (Score:5, Informative)

      by leshert ( 40509 ) on Thursday June 26, 2003 @02:53PM (#6305107) Homepage
      Spamassassin learns in two ways:
      1. Manual training: there is a tool called 'sa-learn'. You can pipe a message to it, or point it to a mailbox, and specify whether the mail is spam or ham.
      2. Automatic training: if the score of the mail is significantly low (definitely spam) or significantly high (definitely ham), it will automatically train on the message. This may seem useless, but it's useful in that SA will then start to figure out patterns in spam or ham that don't trigger its rules.

      I read mail with Mutt, and I've remapped the 'd'elete key to instead throw the message into a 'ham' mbox, and added a 'S'pam mapping to throw the message into a 'spam' mbox. Then I added a nightly cron job to run sa-learn over the two mboxes and truncate them. This has worked very, very well for me... In I haven't had a single false positive since Bayes kicked in about two months ago, and I got my first false negative in about two weeks today. I typically trap 10-15 spams a day.

      One thing to notice: even if you enable it, Bayesian filtering won't kick in until you've recognized at least 200 spam and 200 ham messages. Took me a long time to figure that out (I had plenty of spam, but I wasn't training it on ham at all, which is why I started remapping the mutt commands).

      As far as installing it on a server, your users don't have to be able to read each others' mail. I have it installed so that my wife and I each have our own bayes dbs, so neither of us has to read each others' mail. Plus, different users will regard different mail as spam: anything about the Pittsburgh Steelers going to my mailbox is probably spam, but not hers; similarly, anything regarding Linux going to her mailbox is probably spam, but not mine.
      • Re:What I want (Score:3, Interesting)

        by slagdogg ( 549983 )
        I read mail with Mutt, and I've remapped the 'd'elete key to instead throw the message into a 'ham' mbox, and added a 'S'pam mapping to throw the message into a 'spam' mbox.

        Would you mind sharing your .muttrc for this?
        • You can look at my macros for spam classification [aperiodic.net]. (Linked instead of posted directly because slashcode kept inserting unwanted spaces in them.)

          With these, "S" will classify an email as spam. "H" will reclassify a false positive--it's designed to operate on an email that SpamAssassin has munged as spam and won't work on regular emails. I haven't bothered to write a macro to train regular emails as ham, though I probably should.

          --Phil (Mutt Mafia member since 1998)

          • Nice, thanks! I'm guessing that the reason for the extra spaces in slashcode is to prevent someone from mucking up the page rendering by including a long word with no spaces ... annoying, but definitely necessary.
        • Re:What I want (Score:4, Interesting)

          by leshert ( 40509 ) on Thursday June 26, 2003 @07:48PM (#6307462) Homepage
          Not at all. The macros are short and sweet:


          macro index d ~/Mail/bham^my
          macro pager d ~/Mail/bham^my
          macro index S ~/Mail/bspam^my
          macro pager S ~/Mail/bspam^my


          Then the relevant sections of my crontab look like this:


          0 2 * * * /usr/bin/sa-learn --spam --mbox /home/tim/Mail/bspam
          15 2 * * * /usr/bin/sa-learn --ham --mbox /home/tim/Mail/bham


          In another post (as well as on several sites on the web), it's recommended to bind a key to pipe the message directly to sa-learn. I read my mail on the server, which is an embarrassingly old machine, and sa-learn takes on the order of 30 seconds per email--not fun when you're just doing 'that last check of email before heading home'. Copying the mail to a file is just about instantaneous, and the sa-learn can do its dirty work while I'm sleepting (or watching The Office, as the case may be).
    • I agree with you and we are planning to get to that ASAP. There's some underlying work we need to do on performance first (that's planned for v0.20.0) and then we'll have the foundation for multiusers, pretty much as you describe. If anyone out there wants to write an IMAP module (subclass of Proxy::Proxy) then I'd be very happy to accept it. John.
      • Why would you need an IMAP module? Just intercept the email before it comes to the INBOX, and do the bucket assignment there.

        I was actually thinking of forking the popfile system to work with IMAP.
  • And if you could, would you really want to?

    bloodninja: Baby, I been havin a tough night so treat me nice aight?
    BritneySpears14: Aight.
    bloodninja: Slip out of those pants baby, yeah.
    BritneySpears14: I slip out of my pants, just for you, bloodninja.
    bloodninja: Oh yeah, aight. Aight, I put on my robe and wizard hat.
    BritneySpears14: Oh, I like to play dress up.
    bloodninja: Me too baby.
    BritneySpears14: I kiss you softly on your chest.
    bloodninja: I cast Lvl 3 Eroticism. You turn into a real beautiful woman.
    Britne
  • by aborchers ( 471342 ) on Thursday June 26, 2003 @01:34PM (#6304314) Homepage Journal

    The main thing that may be useful is a Bayesian spam filter written to drop straight into Outlook 2k/XP


    I've now lost one of my primary arguments for switching my colleagues to Mozilla!

    • I've now lost one of my primary arguments for switching my colleagues to Mozilla!

      Then switch them to kmail. Kmail has a pass-through script filter option that would allow you to use any console-mode spam filter for Linux, such as bogofilter.

    • From what I understand, beta testers tell me the next revision of the Outlook client contains a spam filtering function that works pretty well too. I do like the Mozilla 1.4 junk mail features though - works about as good as I could have hoped.
    • by Mikey-San ( 582838 ) on Thursday June 26, 2003 @02:24PM (#6304827) Homepage Journal
      I know your post was meant to be funny, but it brings up a point:

      So what? If more computer products benefit, don't we all? Anything that makes Outlook better is good in my book. Perhaps this will eliminate some virus-and-worm-carrying spam--and that's good for /all/ of us on teh intarweb. ;-)
  • My own personal account is on a shared server at pair.com, and I run SpamAssassin (the perl script, can't put the spamc/d on there since I'm not root).
    I have written on here before how I have saved myself a lot of hassle over the last few months by installing SA. I now stop 100+ messages a day (usually more like 140 now).
    My stats tell me that since Feb, I've stopped over 15K Spam messages. Hot damn.

    Where I currently work now we have Exchange and I wanted SpamAssassin on there, but we weren't getting the money approved to put it on.
    So I hacked in SpamAssassin via an Exchange 2000/2003 EventSink.
    If you want the code for it, feel free to grab it from http://www.cardboardutopia.com/ExchangeSpamFilter. zip

    But do note that if you have many users on your machine, you aren't going to want to use this - an EventSink on Exchange runs in serial, so SpamAssassain's Perl script (the spamc/d doesn't work under Win32) will get executed on every incoming mail, and it will have to wait until it is done before it gets the next one.

    We process about 2000-5000 incoming messages a day and it does okay, but we have a very light load.
    • SpamAssassin is nice, but it's nowhere near the 99% elimination claim in the article (an vaporous claim in the article? The hell you say!)

      SpamAssassin, set at 5 (after I got a false positive at 4) stops about 75-80% of spam, but with some more rules from me (how did SpamAssassin let 'huge c-cks' get through?!) stop closer to 90%.

      The only solution I've tried that worked well has been white lists, but that only works so well because I don't make a lot of new friends :)
    • We ran SpamAssassin on Python.org and Zope.org for a considerable lenght of time. We had, however, many false-positives to deal with (we manually checked everythiong that scored everything between 5 and 10 points on the SpamAssassin scale). Usually, we had to review between 10 and 15 messages a day like this.

      We recently switched to SpamBayes, and our false-positive rate so far is 6 out of 2200+ spams (almost 12 days of traffic, with certain foreign charactersets, malformed email headers and blacklisted ema
    • by vanyel ( 28049 ) * on Thursday June 26, 2003 @03:12PM (#6305258) Journal
      I run a small ISP with spamassassin installed, and I had to increase the default quota when I upgraded to the version with Bayesian filtering and its multi-megabyte databases per user. Combined with spamd bugs forcing me to switch back to running spamassassin individually and the fact that spamd still doesn't serialize processing, so the system still gets hammered by a flood of spam, I'm looking forward to greylisting [puremagic.com] to help take the load off spamassassin.
  • Spambayes (Score:5, Informative)

    by Chromodromic ( 668389 ) on Thursday June 26, 2003 @01:35PM (#6304330)
    I use Spambayes with Outlook 2000, and it takes a little tweaking, but it works as advertised. Ahhh, the magic of mathematics. Just now, brought up Outlook, checked my mail and three little messages offering a free Sony headset, 70% off cell accessories, and a chance to take an IQ test just got tossed into my spam folder. Thanks anyway, but I think that means I just passed my IQ test.

    Every so often I go in and take out some old, old spam, just to make sure my current preferences are being represented and that's all the maintenance that's required.

    This is, however, the second time I've trained the filter. The first time, it incorrectly identified my FreeBSD status mails as spam, and from then on was throwing those into the Spam folder. My own fault, though, since I hadn't included any of these messages in my representative ham.

    If you run Outlook, download this filter and use it. You'll be doing yourself, and a world that doesn't need fat-injected, herbally enhanced penises, a favor.
    • Popfile (Score:3, Informative)

      I use PopFile [sourceforge.net]. What I like about it is that it easily lets me use multiple personalities in Eudora, Outlook or any other mail client. Nice web based interface and a very active development community.

      You can run it locally on Windows or Linux. But, you can also set it up on a server and then use it to filter e-mail from multiple client machines. That's what I like about it. I have a home machine in my basement office but also upstairs in the TV room. Unlike plug-ins that only work locally, I can have
    • Re:Spambayes (Score:4, Insightful)

      by AssFace ( 118098 ) <`moc.liamg' `ta' `77znets'> on Thursday June 26, 2003 @02:20PM (#6304774) Homepage Journal
      I have seen all of the local client software and I personally have never bothered with it.

      I always felt that the whole point of spam being annoying was that it wasted bandwidth. It gets sent to my server, and then I have to download it all from my server, and then it gets sorted away from my eyes in my client.

      It is fairly trivial if you get enough regular mail for it to matter, and you are on a fast connection.

      But I can't tell you how annoying it is to be on a slow dial-up connection and download 50 messages and then see that they all got filtered into the spam folder and that there were no "real" messages.
      While there is a nice feeling of seeing them all get caught, it is annoying to have to wait for a download (and pay for it) and then get no return on the investment.

      That is why I always try to have the spam blocking on the server side. Although I now spend most of my time using ssh into my server and that way it isn't downloading all of the mail until I want to see something.

      Perhaps if I combine the fact that I have SA on the server, and then if I also had a client side option, I would get everything properly blocked that way (the only reason stuff gets through my server setup right now is if the server is under a high load, then my SA script will time out and the mail gets through).
    • SpamNet (Score:3, Informative)

      by SunPin ( 596554 )
      I use spamnet by cloudmark. It catches everything. I can't remember the last time I had to click the "block" button. I'm very conscious of where my email ends up and I'm a hardcore advocate of email aliases. As a result, since September (last major crash), spamnet has blocked 4000 pieces while I've actively blocked only 11.

      That's pretty f'n good in my book. So good, in fact, that I send all blocked messages to the "Delete" folder instead of the default "spam" folder and set outlook to permanently delet
  • by Meat Blaster ( 578650 ) on Thursday June 26, 2003 @01:37PM (#6304345)
    I've tried a number of different ways to filter spam, from whitelisting to Bayesian filtering, and Bayesian seems to offer a good balance between not eating too much of the ham while letting the spam through. Not too shabby, especially given that it comes with Mozilla now, and I think it's an excellent way of allowing clients to determine what they want to see without infringing free speech.

    I don't know if I'd want it in Python, though... it does seem to be a good deal slower already than other spam filtering methods without putting it in a scripting language. Getting it in Outlook can only be good for the net (can Bayesian be applied to things like spam from Internet virii as well?)

    • What can I say? I rarely ever see spam these days thanks to this approach. Popfile is one of the more mature solutions to spam, although it's a classifier not just a spam filter.

      Since Feb I've had 2,215 messages and it has made only 37 mistakes. 98.32% accuracy. I've tried a few commercial products and they were lucky to approach 50% accuracy.
    • I've tried a number of different ways to filter spam, from whitelisting to Bayesian filtering, and Bayesian seems to offer a good balance between not eating too much of the ham while letting the spam through.

      Agreed. Although, I'm a bit disappointed that many of the bayesian filter projects don't offer whitelisting in conjunction with the filters. If I'm running a business, it's really important that I allow all email from *@myclient.com, regardless of what the spam filters think about it.

      I think it's

      • In a sense, popfile does provide whitelisting at the level you are looking for. Popfile has what they call "Magnets" where you configure a string that you want popfile to look for, and before it applies any of the baysian rules to the message, if it sees that string, it classifies the message as you request.

        Functionally I believe this means that it effectively ignores the content at that point, meaning that other messages that you receive are not classified relative to these messages. I could be wrong howe
    • can Bayesian be applied to things like spam from Internet virii as well?

      What if the the filtering programs had a feature that would allow somebody to send out the "signature" of an email virus that the filter could use to block the virus before it had ever actually seen one, by adding its characteristics to the list of things that weigh heavily toward spam so it would be filtered out before ever reaching Exchange/Outlook.
  • by adamhupp ( 29341 ) on Thursday June 26, 2003 @01:37PM (#6304351) Homepage
    The Outlook plugin may have been written by Mark Hammond but spambayes is very much a group effort. The project can be found at spambayes.sf.net [sf.net].

    I've been using spambayes for months now and it really is quite amazing. Now, when I get the occasionaly spam in my mailbox it's actually interesting because I want to figure out why it made it in. The number of false positives is almost nil, and the ones that do get hit are spammy looking autogenerated reciepts from purchases I've made. It's made reading email a much more enjoyable activity.

    -Adam

    • by Wakko Warner ( 324 ) * on Thursday June 26, 2003 @01:58PM (#6304584) Homepage Journal
      The number of false positives is almost nil, and the ones that do get hit are spammy looking autogenerated reciepts from purchases I've made.

      This is quite possibly the only complaint I have about spambayes, too, and it's not even that big a deal to me. After about a month of collecting spam in its own folder (named SHIT, oddly enough), it had learned enough that I was able to dial down my SpamAssassin settings (I use an old version of SA still, too, without the bayesian stuff built in -- too lazy to switch; spambayes works well enough that it's not worth it.) I check my incoming spam folder once or twice a week now, as opposed to once or twice a day when I only ran SpamAssassin at a relatively forgiving (4.5-5.5) setting.

      There are a few thousand spams in SB's crap folder now; it's gotten so good that I can't really remember the last time I've had something miscategorized as spam, and of the 50-60 spams I get per day, usually only one or two make it through to my inbox, if that. Half of the time, I don't get any at all.

      If you didn't have a reason for installing a Python interpreter before, now you do.

      - A.P.
  • by Anonymous Coward
    The first discovery I'd like to present here is an algorithm for lazy evaluation of research papers. Just write whatever you want and don't cite any previous work, and indignant readers will send you references to all the papers you should have cited. I discovered this algorithm after ``A Plan for Spam'' [1] was on Slashdot.

    Spam filtering is a subset of text classification, which is a well established field, but the first papers about Bayesian spam filtering per se seem to have been two given at the same c
  • by notque ( 636838 ) on Thursday June 26, 2003 @01:40PM (#6304378) Homepage Journal
    Would you use the phone if you had to listen to a 10-second brothel advertisement every time you made a call?

    Yes.

    Definately Yes.

    Is that a feature I can have added?
  • Eudora users... (Score:3, Informative)

    by Control-Z ( 321144 ) on Thursday June 26, 2003 @01:40PM (#6304380)

    Eudora 6.0 beta has spam filtering which seems to be Bayesian. It's a little slower to learn than PopFile, but it's pretty good so far, and of course integrated with the Eudora UI.

    http://eudora.com/betas

  • by ToadMan8 ( 521480 ) on Thursday June 26, 2003 @01:40PM (#6304387)
    I sat on the E-Mail policy team (a branch of the Strategic Planning team) for Miami University (Oxford, OH, not Florida) this last year (as a technical advisor, student and support desk employee. We looked at all sorts of spam solutions, as the president decided this should be a main focus (apparently the Viagra adds hit a bit too close to home for comfort ;)).

    The problem in the educational market, though, is that, not being a business that can make rules and force people to live by them, educational establishments have annoyed customers (students and faculty) sometimes if any spam is blocked. (research, etc) False positives absolutely can't be tolerated. So a ranked system (spam assasian) that suggests the possibility of spam is not on the best but the only solution we have avalible. Mail will be ranked and users can make rules that trash everything but a guarenteed perfect mail, if they so desired. Or they can leave them all alone. So intelligent filtering is a necessity, not just a bennefit.

    On another page, I had an odd place during this discussion of the team. I do not receive spam. (Please, don't start now). My MUOhio.edu address simply doesn't get a single piece of spam e-mail. I have had the account for two years. I have over 3000 messages in various folders. And none are spam at all. I just haven't signed up for anything with it. I put the e-mail addy on webpages too (that I author) and haven't gotten a single thing. But oh my the trash "spam" account gets 60 a day. On AOL. That blocks 80% of incoming mail. Ironically, they had MUOhio.edu blocked weeks back.
    • I don't know spambayes, but bogofilter most definately can operating in a "ranking" mode:

      • X-Bogosity: No, tests=bogofilter, spamicity=0.000000, version=0.12.2
      • X-Bogosity: Unsure, tests=bogofilter, spamicity=0.499150, version=0.12.2
      • X-Bogosity: Yes, tests=bogofilter, spamicity=0.969917, version=0.12.2

      Then you can header-match in your MUA all you want--or not. (I run it all through procmail, but that's because I want all the filtering done before it hits my IMAP server.)

  • I've got this installed for Outlook XP. Either I don't have it configured correctly (likely) or it just doesn't work well. Even using the emails in the spam folder to 'train' it, it still misses messages.
  • Mozilla Mail (Score:3, Informative)

    by respite ( 320388 ) on Thursday June 26, 2003 @01:42PM (#6304410)
    In case anyone hasn't tried it yet, the Bayesian filters in the mail client of the Mozilla suite are really impressive. They have worked close to flawless for myself.
    • Re:Mozilla Mail (Score:3, Interesting)

      by drinkypoo ( 153816 )
      They work pretty well for me, but nowhere near flawless. Some days I get 25 messages that go into the spam folder and only 3 in my inbox, some days I get about 10 in the spam folder and 5 in the inbox... It's a lot better than nothing. The real reason I run Mozilla for mail is the HTML rendering, which is better than any other mail client I'm aware of; The secondary reason is the bayesian filtering, and the tertiary is Enigmail, though no one I know bothers to use encryption anyway.
  • by steveha ( 103154 ) on Thursday June 26, 2003 @01:47PM (#6304465) Homepage
    I wrote an article [linuxjournal.com] on how to set up SpamProbe on a server, and make it easy to train. You could also use Bogofilter or any other trainable spam filter, set up the same way.

    I get at least 100 spam messages a day now, and I only see about a half-dozen or so. SpamProbe deals with the rest, and I don't have any problems with false positives. (SpamAssassin thinks that ads for LinuxWorld Expo are spam, but as I have it trained, SpamProbe doesn't.)

    steveha
  • So-so article (Score:4, Insightful)

    by scottme ( 584888 ) on Thursday June 26, 2003 @01:48PM (#6304480)
    For an article in an "IT tech" section of a paper, this is really very weak.

    It really doesn't do much more than precis Paul Graham's arguments, then ends in a blatant plug for just one Outlook addon.

    I suppose if there are still people in the column's audience who haven't heard this all before, and it gets the message out that spam can be effectively filtered, it's a minor goodness.
  • by dioxn ( 640015 ) on Thursday June 26, 2003 @01:49PM (#6304488)
    I've noticed that the spam that has been getting through my Mozilla filter are the ones with innocuous sounding subjects and an embedded image.
    Could this be the future of spam?
    Does anyone know if any spam filters pick up on this patern or lack of pattern (after all there are no words in the body usually.)
    • Bayesian is more or less word based, so graphical only messages fly right by my Mozilla mail filters. I believe it does the check after the html has rendered. If they ran the filter before the html was rendered, they might have slighty better results. Eventually all spammers will learn the undetectable patterns that only a handfull seem to know now, and it will once again render mail filters useless. I hate HTML e-mail.
    • by zerocool^ ( 112121 ) on Thursday June 26, 2003 @02:20PM (#6304780) Homepage Journal
      Does anyone know if any spam filters pick up on this patern or lack of pattern (after all there are no words in the body usually.)

      Um, only read emails in plain text? Use mh.
      inc; scan; show last
      By the way, those images are baaaad. Usually they're something like img src="blahblah.jpg?userid=32898392" and then, when you open it, there's a log of the image with the userid 32898392 being fetched. Therefore, they know that your email address is valid. So, it's a good idea to filter out images anyway.

      But, come on. Email is a medium for transmitting text. It's not supposed to have flowery backgrounds, blinking text, and embedded images. Mabey i'm a purist? But, it's another thing that use to be beautifully simple that the explosion of advertising on the internet has rendered unuseable.
    • This is one of the reasons I have configured Evolution to not display remote images, unless I request them. The other is that pulling remote images has the functionality of verifying your e-mail address. (server operator generates a couple million unique random numbers, creates a table of associations between e-mail names and the random numbers, sends each e-mail address their random number as an img src=protocol://server/uniqRanomNunber/image.php, which does a lookup on the uniquRandomNunber, and confirms
      • Turn off html mail for Outlook and help keep them from validating your address through this method.

        Place these two keys in .reg files of their own and be able to quickly switch between viewing html and plain text mail. taah dahhh!

        [HKEY_CURRENT_USER\Software\Microsoft\Office\10 . 0\ Outlook\Options\Mail]
        "ReadAsPlain"=dword:0000000 1

        OR to turn it back on and view those pretty pictures

        [HKEY_CURRENT_USER\Software\Microsoft\Office\10. 0\ Outlook\Options\Mail]
        "ReadAsPlain"=dword:0000000 0
    • This isn't the *future* of spam, it's the *present* of spam. It's basically a way for spammers to track valid addresses - rather than wait for a bounce to kill one, wait for an open to validate one.
  • Mozilla (Score:3, Informative)

    by Little Dave ( 196090 ) on Thursday June 26, 2003 @01:51PM (#6304509) Homepage
    Having used the spam filtering built in to Mozilla for the last six months, I can testify to its effectiveness. In very little time at all, I'd trained it to send 95% of the filth to the spam directory and avoid doing the same for 95% of good mails. For me, not having to run a "middle man" piece of software was a real boon.

    However, my life isn't totally spam free, as I find that I become neurotic about those 5% false positives that get unhelpfully moved to the spam directory, so still end up having to sift through the grot every once in a while. On the plus side, I now have a solution to my tiny cock problem, I've arranged cheaper home insurance and I have the email address of several horny co-eds who I'm assured are hungry for man juice.
    • I don't want to look at the spam, ever. I want it to go to /dev/null before I even download my messages.

      This is at best a band-aid and with the usual mistakes and slip-ups it hardly seems like a very good one. I mean if I have to sort through my junk box to check for mislabeled emails its not doing me so much of a favor.

      All this talk about smart filtering and I'm starting to feel like you've missed the point, your still getting spam. Who cares if its semi-sorted.
  • by Anonymous Coward on Thursday June 26, 2003 @01:54PM (#6304543)

    As I wrote only late last night [against.org], using Bayesian classification with only two categories (spam and "non-spam") is somewhat short-sighted, since if properly trained, a Bayes classifier can do a much better job than ordinary mail filtering (procmail, Mozilla or Mail.app filters, you name it).

    In fact, if I had to bet on the next "killer apps", mail sorting and RSS filtering based on Bayesian classification would be right at the top of my list, based solely on the actual time-saving benefits for users. And I can't see any reason for Bayesian filtering not being included in Mozilla Mail [mozilla.org] and Apple's own (revamped) Mail.app [apple.com].

    I have to use Outlook at work, and after setting up Outclass [vargonsoft.com] (which requires POPfile [sourceforge.net]) with several "buckets" to classify my corporate e-mail by project and field, I'm definetly not going back. Outlook, even with extensive use of Rules Wizard and categories, simply cannot cope with the diverse kinds of project-related e-mail I swap with colleagues, and Outclass [vargonsoft.com] is the only thing I could find that could deal with Exchange, PST folders and multiple Bayesian "buckets" categories.

    Come on, do the right thing and tell Apple [apple.com] and The Mozilla Project [mozilla.org] that you want configurable Bayesian filtering on their mail clients.

    • I've tried multiple buckets with popfile/outclass, it gets a significant percentage of the classification correct, but it also gets enough wrong to be a serious pain in the arse when important mails get automatically misclassified into a low priority mailing list folder.

  • by Daimaou ( 97573 ) on Thursday June 26, 2003 @01:58PM (#6304583)
    I hate spam just as much as the next person, but I must admit, without it I wouldn't be the horse-sized love stud that I am. Thanks spam.
  • Dirty Spammer Tricks (Score:3, Informative)

    by dprice ( 74762 ) <daprice.pobox@com> on Thursday June 26, 2003 @02:01PM (#6304603) Homepage

    I have been using the Mozilla junk mail filter for a couple of months now. One pop mail account is one that I started using in 1996. It is a spam magnet. In the time I have been using Mozilla, it has accumulated over 12,000 spam messages. That should be plenty of training for the Bayesian filter.

    Mozilla's filter does a reasonably good job at catching spam, but I still get a handful of messages every day that slip through the filter. The ones that slip through seem to be messages that have intentionally munged the spammy words with spaces, numbers, and misspellings. The spammers know that people are filtering, and they are successfully getting through the filter with their dirty tricks. Another trick spammers use is to send a message with nothing but a graphic ad. The filter doesn't have enough words to judge the the spam, so the message slips through.

    I also had some 'ham' messages get filtered, so I still have the annoyance of having to check the 'junk' folder periodically for wanted messages. The filtering makes life easier, but it is still not an ideal solution to the spam problem.

    • The ones that slip through seem to be messages that have intentionally munged the spammy words with spaces, numbers, and misspellings. The spammers know that people are filtering, and they are successfully getting through the filter with their dirty tricks.

      Well this is really self-defeating on their part. Sure, now they are getting their spam past your filters, but are you going to remortgage your house with a company that promises you "The best m0rtgag3 rates in the universe! Apply now for these incredib
  • Our problem is that 99% of people read email via POP, and POP only serves one mailbox per person. It is extraordinarily difficult to train everyone to use a spam filter individually, and yet installing one on the server can't work with POP's limitations. Frustrating.

    Secondly, Microsoft is in the fray now. Bet any amount they will offer a authenticating email service that requires using Windows XP to work. It will work really well, you won't be able to communicate well with people who don't use it - standar
  • The spam I do see (Score:5, Interesting)

    by steveha ( 103154 ) on Thursday June 26, 2003 @02:02PM (#6304624) Homepage
    I'm using SpamProbe, and it blocks almost all spam I get.

    Much of the spam that gets past it is so minimalist it cannot be blocked by a Bayesian filter. I get messages like this:


    Subject: a nice lady wants to talk to you

    see the pictures [127.0.0.1]

    no more mail [127.0.0.1]


    It's like someone is trying to put so little in the message, that there is nothing to filter. If only they would use the stock "We are sending you this because you opted-in on it. Click on this link to remove your address." If they used that, I'll never see the message; SpamProbe will grab it. But how could I train SpamProbe to detect the minimalist ones, without blocking everything forever?

    So far I don't get too many of the minimalist ones, and I just hit delete. If it becomes widespread, I'll have to start using Vipul's Razor [sourceforge.net] or something.

    The other kinds of spam that get past SpamProbe are the ones that have rampant misspellings. Since none of the words are in the database, they don't match as spam terms:


    Subject: make moneey on EBAYxbbid

    Want to make moneyzseqw? Click here...


    I really think that I should write a filter that spell-checks an email, and rejects it if over 50% of the words with 5 or more letters are misspelled.

    steveha
    • by GGardner ( 97375 ) on Thursday June 26, 2003 @02:25PM (#6304839)
      For the spammers who are trying to use misspellings to get around filters, I wonder if soundex could fix that problem quickly. That is, instead of doing the Bayesian calculations on the raw tokens, calculate probabilities based on the soundex [frontiernet.net] values of the the tokens. You might need to teach soundex that the number one sounds like I, and other leet-speek-like things, but this might solve the problem quickly and easily.

    • I really think that I should write a filter that spell-checks an email, and rejects it if over 50% of the words with 5 or more letters are misspelled.
      >>>>>>
      That would rock. If such a filter was on Slashdot, Slashdot's post volume would drop by 90% :)
    • The original article (a href="http://www.paulgraham.com/spam.html">http:// www.paulgraham.com/spam.html) actually talks about this ... he suggests actually visiting the URL mentioned and running the returned page against the bayes rules, also factoring in HTTP redirects, etc. ... of course, analyzing the URL iteself can help too.
  • by dubStylee ( 140860 ) on Thursday June 26, 2003 @02:06PM (#6304658)
    Suppose

    1. I have a friend who uses the same kinds of words as I do and who uses Outlook (ok, an aquaintance, because friends don't let friends ...)

    2. An email virus attacks this person, snarfs up his Ham, runs a Bayesian filter on it and comes up with Spam specifically tailored for this person's aquaintances.

    There's a science fiction book waiting to happen in here somewhere. If so, I own the SCOpyright on it.
  • What I don't like (Score:5, Interesting)

    by Boyceterous ( 596732 ) on Thursday June 26, 2003 @02:06PM (#6304660)
    about this kind of filtering is that it has to download the email content - not always as good idea, especialy in a Windows environment. Besides, I can identify spam just by looking at message header information. Sender, recipient, and subject line are nearly always enough. Plus I don't need to waste time, bandwidth, or get subjected to offensive graphics, or risk 1-pixel confirmations or getting hacked by the latest security issue. My homespun message header analysis program drops nearly all spam, and results in few legit email rejections. I score the headers based on missing recipient, sender info, keywords in subject, string match in sender email or name, punctuation count in subject line, number of contiuous spaces in subject line, plus a few other things that seem to run common in the spam I get. I can also permit certain email addresses to pass no matter the score. It's not fancy, but it works, and I never have to waste time drawing the whole content down to my local machine. What I do may not work for everyone, but it seems that in most cases it should, unless you get a lot of email from unknown (non-spam) sources - not typical for the average email user.
  • Spam is a poor use. (Score:3, Interesting)

    by Lord Bitman ( 95493 ) on Thursday June 26, 2003 @02:07PM (#6304671)
    this is like inventing something as useful as the Knife, and using it only to attack salesmen. Why bother stopping with spam? Why not apply this filter to, say, absolutely everything? Since I just said "absolutely everything", I wont bother giving examples.
    Training something to know how likely something is to be true, that sounds too useful to waste any time with on spam at all.
  • by mpieters ( 149981 ) on Thursday June 26, 2003 @02:13PM (#6304726) Homepage
    SpamBayes was originally conceived by Tim Peters and co at Python Labs, who improved on the orginal algorithm considerably. From there on out, many people helped tune and perfect the implementation, making it the most effective Baysian-based spam filtering tool currently available (IMNSHO).

    Mark Hammond then wrote the Outlook plugin, which, admittedly, is considerably more code than SpamBayes, but not SpamBayes itself.

    For the complete background on why SpamBayes is so good at what it does, and it's history, see:

    Marc's is not the only application frontend for SpamBayes, here is a list of others: No apologies for this my pedantry offered.
  • by slagdogg ( 549983 ) on Thursday June 26, 2003 @02:48PM (#6305064)
    Bayes rocks, been using it with spamassassin and it kills 99% of my spam. The problem is when some asshole spammer uses my email address in the 'From' header of his spam ... then I get scores of 'user not found' or 'virus detected' emails from legitimate mail servers ... it's not spam, but it's just as annoying. How do you guys deal with this problem?
  • The math (Score:2, Informative)

    by bpfinn ( 557273 )
    I think Tom Mitchell did a good job in explaining the math in his book Machine Learning [amazon.com]. It's a very pricy book, so maybe you can look for a used copy.
  • This is the big question. Bayesian filtering has been in use for a couple of years now, and is well-proven, IMO. What's wrong with Microsoft? Why are they dragging their feet on this? They should have been shipping this with OE a couple of years ago, if not before. Not only would this have given the average user some relief, it would have slowed the recent explosion in spam itself. And it would have been so easy to do. Fuck Microsoft, one more time.

    • Microsoft hasn't done this for their own reasons, but consider that they have consistently dragged their feet from day one regarding system security on Win32, Lookout, etc., and are only recently talking about end user system security integrations.
  • When the hell did the term "Ham" start getting used? I missed it completely 'til this article.

    Not that it matters all that much, but Hormel, who has taken use of "Spam" in pretty good graces, can't be happy about this at all. It's one thing when your product is linked to a negatively-perceived other concept, but then the further implication that Ham=good, Spam=bad... hrrm.
  • I was at first leery of trying a Bayesian filter. I thought to myself, its so simple, it won't work good enough. So I installed PopFile (integrates as an SMTP proxy on Windows, extremely easy to use/setup) and was amazed. After a bit of training I now have about a 97% accuracy, sometimes even higher.

    I also used SpamAssasin for a while, but it never seemed to do quite so well of a job. It let alot of junk get through until I set in lower, then I got some false-positives.

  • Now why would I care about an Outlook drop-in? Besides, the only sentence I could come up with that uses both "Outlook" and "usefull" is, "Reformating the hard drive is a usefull way to remove Outlook and other viri, that also eliminates the MS detritus that common viri feed on."
  • ...is that they don't tie up port 8080, which you may need while doing web development locally. This isn't a huge problem (the defaults can be changed), but I wind up having to shut down Popfile when playing with Zope, for example.
  • by crazyphilman ( 609923 ) on Thursday June 26, 2003 @03:19PM (#6305308) Journal
    A Bayesian filter that reads personal ads, compares them to ads posted by women who are KNOWN to have been "easy" (on a sliding scale, configurable, ranging from "mildly slutty" to "dangerously psychotic nymphomaniac"), and returns a list of likely phone numbers.

    Hell, I'd pay MONEY for a piece of software THAT good (Hmm, clickety-click, select "nymphomanic", enter search site... Ah! This one has an oral fixation! Thank you, Mr. Bayes!).

  • You guys are a bunch of hypocrites. You don't really want spam to stop. You love spam.

    Every spam thread is the same: I use X, and it blocks 98% of my spam, with no false positives! I use Y, and it blocks 99.9% -- take that! Here, I use Z + Y with these custom Perl scripts I wrote that interface with procmail and stop 101% percent of spam! It doesn't matter, because I never get ANY spam! Spam is only because people buy things in spam! What morons! Bow before me, for I am 1337!

    Spam gives you something to fi
  • It's semi-distributed, in that users install a small plugin for Outlook, adding "block" and "unblock" buttons to the tool bar. The entire community of users works against spammers.

    It works well. When I check my mail, I can watch the 50 or so spams I get daily pop into my inbox, and then promptly fly right back out again.

    (Blatantly stolen from Spamnet's Learn More page [cloudmark.com])
    When the message comes in, SpamNet generates a unique fingerprint of that message. The fingerprint is a one-way hash, or unique string of
  • I have Evo picking up my mail from server. Right now I have some simple filters, that catch the little bit of spam I get.

    Is there an 'integrated' solution that works within Evolution. I heard some time back they were going to do one.

    They have a command line filter. May be that can be utilized?

    any experiences?

    thanks
    LinuxLover
  • That exists in various forms in various packages (that I want to see in Outlook, because that's what I use...bash as will...but it works best for me for various reasons) is:
    - Only load images from HTML mail from addresses in your personal address book
    and
    - Whitelist/classify based on users in your address book.

    If those two additional features and my Spambayes setup, I'd be very happy.
  • by esanbock ( 513790 ) on Thursday June 26, 2003 @03:56PM (#6305664)
    1. Use Debian
    2. apt-get install spamassassin
    3. apt-get install hotway
    4. Add this to your /etc/inetd.conf: pop3 stream tcp nowait nobody /usr/sbin/tcpd /usr/bin/hotwayd
    5. Switch to Kmail
    6. Menu: Settings|Configure Filters
    7. Add first filter.
    a. Select Match Any of the following
    b. Select size 250000
    c. Filter action: PIPE THROUGH spamassassin
    8. Add second filter
    a. Select 'Match any of the following'
    b. Type 'X-Spam-Flag' (no quotes)
    c. Select equals. Type 'YES'
    d. Filter action: Move to folder [your spam folder]
    9. It's crucial thta the second filter happes after the first (use the arrows to the left).

    There you have it - a spam-free Hotmail account. Not quite setup.exe, but this is Linux after all.

  • I tried a few months ago to write a Spam filter in Python, but no matter what I tried, this was the only output I could receive:

    I DON'T LIKE SPAM! I DON'T LIKE SPAM! I DON'T LIKE SPAM!

  • by Moderation abuser ( 184013 ) on Thursday June 26, 2003 @05:43PM (#6306638)
    I've just been migrated to Notes from Outlook. Not a happy bunny till I discovered how powerful it is with stuff like agents.

    The only thing I'm missing now is a spam classification tool like popfile for notes.
  • For the Bazillionth Time, client side spam filtering does not address the problem. It's a waste of time.

    The more client-side filters that are in place, the more spam will increase. It's already a cat-and-mouse game between spammers and filters. In the mean time, almost 70% of existing mail traffic is UCE. Filters don't stop that at all.

    You have to stop spam at the source. You have to force spammers to act responsibly and not exploit network resources without appropriate compensation. Only when this
    • Not it's not... (Score:5, Insightful)

      by Goonie ( 8651 ) * <robert.merkel@b[ ... g ['ena' in gap]> on Thursday June 26, 2003 @10:41PM (#6308231) Homepage
      Client side filtering is not an ideal spam solution, but it's a good thing on both a micro and macro scale.
      • For the 99% of people who don't respond to spam, it makes no difference to the spammer whether they filter it or delete it manually. At an individual level, it reduces the amount of spam I have to deal with to managable levels.
      • For the 1% that *do* respond to spam, having a filter might reduce the amount of spam they respond to and thus reduces the financial rewards for spammers. Anything that reduces the financial rewards for spammers is going to help reduce the spam levels.
      • If spammers are spending all their time and money figuring out how to beat filters, that's time and money that they're not using to send spam.

      As for your indictment of spam filtering providers, could you please explain where the spamassassin devteam is making money?

      My choices with regards to spam at the moment are simple. Use spamassassin or something like it, or wade through spam myself. I know which I'd prefer.

It is easier to write an incorrect program than understand a correct one.

Working...