Working Bayesian Mail Filter 313
zonker writes "A real, working honest to god Bayesian spam filter. I've been waiting for something like this for a while (since I first read Paul Graham's research paper on this very topic a few weeks ago). Well here's POPFile, a small but extremely effective Perl script that runs on just about any system Perl does. After just a little training was I able to get very effective filtering out of it. From what I understand the new email client that comes with OS X Jaguar has a feature similar to this, but I don't know if it is true Bayesian. Hopefully this kind of feature will become more prevalant in client software as I see the Google results for it are growing."
Whas that? (Score:2, Interesting)
Server-side solutions? (Score:3, Interesting)
Mozilla in Process of adding Bayesian filter (Score:5, Interesting)
Bayesian? Wow!!! I'm sooo excited. (Irony!) (Score:5, Interesting)
More intelligent classification algorithms can solve non-linear problems far better. Check out Kernel Machines [kernel-machines.org] and, somewhat older, Maximum Entropy models.
Enough nerd talk for today :-)
product of marketrons (Score:2, Interesting)
You know, on this issue, you really depress me. You are clearly not of the academic nature, so your stance toward something thats probably way above your head really frustrates part of me.
As long as you're not developing the idea, it shouldnt matter how it works as long as it works.
I read the original article here as you did to. After all the mumbo jumbo about learning, i picked out one effective tip from the article on filtering my email: filter out HTML.
With 1 line of regex I eliminate 95% of my spam:
match and throw it out.
Re:Sure it's promising (Score:2, Interesting)
Re:Server-side solutions? (Score:4, Interesting)
You know what I'd kill for? (Score:3, Interesting)
I work on the helpdesk of a small ISP; I also take care of the spam filtering, and answer abuse@. We recently added SpamAssassin, and God does it rock [dowco.com]. (The big spike you see is me getting MRTG to graph what SA catches now; it's 6-10 times better than what we used to catch.)
But I still get complaints from our customers about spam that gets through. Just the other day a crapload got through because it was relatively subdued spam (no webbugs, NO LINE OF YELLING, etc); unfortunately, it also advertised pictures of young boys having sex. It's hard to explain why it's very, very hard to filter for this sort of thing, especially when I'm going through the talk for the nth time this week. (I need a good analogy that non-geeks can understand; I'm still looking.)
The good folks at DeerSoft [deersoft.com] have a version of SpamAssassin for Outlook, and are promising one for OE Real Soon Now. But I would loooooooooooooooooooooooove a good spam program -- this or SA or something else -- that I could point our customers to. Download, double-click, say yes, and bam it's installed. I can figure out how to install this on a Unix box; I could probably, eventually figure out how to do it on a Windows box; there's no way the customers could do it.
Or am I missing good, free spam filtering for Windows? Can anyone point me in the right direction?
Slightly OT: There has got to be a huge market for setting up spam filtering for small businesses. My idea: Tell 'em that if they provide the box -- an old Pentium or 486 will do -- I'll set up spam filtering and a firewall on it, set up some maintenance tools (whitelist this, firewall that). They get great mail service, I get $x00.
Re:Server-side solutions? (Score:4, Interesting)
Staged Categories (Score:2, Interesting)
Also, from what I saw in the article, there will already be a next level that spam can take: image-based messages, misspellings of key words (klik, Clic, Clik, etc), using 0xfe0000 for almost-bright-red.
Re:Bayesian? Wow!!! I'm sooo excited. (Irony!) (Score:1, Interesting)
Apparently, it does work, though I can't whip out the references just now.
Anyway, naive bayes is interesting mostly because it's so damn fast and only requires one pass through the data; and it works well, it just makes you feel stoopid because it's called "naive".
Re:Bayesian? Wow!!! I'm sooo excited. (Irony!) (Score:3, Interesting)
In the long view, all compression is machine learning anyway
What about random misspellings? (Score:2, Interesting)
Multi-purpose tool (Score:3, Interesting)
Re:What about random misspellings? (Score:3, Interesting)
Ifile does this, bogofilter does this with some wangling in procmail,
That way, if someone sends something that's still mostly spam (one or two words in common with spam, enough to tip the balance) then all the neutral words will be tarnished as well.
Re:What about random misspellings? (Score:2, Interesting)
Re:Sure it's promising (Score:4, Interesting)
So if nothing else, it will force spammers to stop using red text - that has to be some kind of victory
Tim
Growing a spam filter -- a firsthand experience (Score:4, Interesting)
So, the graduate CS course I'm taking this quarter is Evolutionary Computing, which is all about the convoluted nonlinear multidimensional-search-space problems, and guess what our current homework is? That's right, taking statistics on spam data, and using genetic algorithms to evolve a working spam filter.
Due to one typo and two thinkos in my fitness evaluation function, my algorithm evolves -- within only a few dozen generations -- a solution which looks like this:
And it's right.
Re:Professional Looking Spam May Be Impossible (Score:1, Interesting)
There is nothing accidental or slap-dash about the layout, or use of colour, or any of the factors involved in laying out an email that will generate sales. I know this because it's my job to know about - I'm in the porn business.
You might hate spam - I know I do - but it works. It works very well. And the way the email looks makes it work best of all.
Spamassasin (Score:3, Interesting)
(Why hasn't it been done? Bayesian networks are only taught in AI and statistics classes).
What really interests me is that Spamassasin claims to use a genetic algorithm [spamassassin.org] to rate how likely an e-mail is to be spam.