Spam Trap Claims 10x-100x Accuracy Gain 419
SpiritGod21 writes in with a NYTimes article on a new approach to spam detection that claims out-of-the-box improvement of 1 or 2 orders of magnitude over existing approaches. The article wanders off into human-interest territory as the inventor, Steven T. Kirsch, has an incurable disease and an engineer's approach to fighting it. But a description of the anti-spam tech, based on the reputation of the receiver and not the sender, is worth a read.
Makes sense (Score:5, Informative)
Dan East
Re:x100 improvement in accuracy? (Score:4, Informative)
Over 99 percent spam blocking means fewer than one mistake in every 100 messages processed. That's 10 to 100 times fewer mistakes than any other available systems.
Dan East
Re:KInda flawed (Score:5, Informative)
So, if I understood the article correctly, this technology will classify more email as spam the more spam you have received.
No, that's not how it works at all. Let me try putting it as a concrete example. You have a friend, Jane, who likes to swap stupid chain emails, subscribes to all kinds of "voluntary spam," and generally receives 1000 spam mails a day. Jane's a great lady, don't get me wrong, but you know the type of person I mean. You talk to her in real life, but over email she is incredibly annoying, as most of her messages are essentially meaningless.
Now, let's say that BOTH YOU AND JANE receive the same message M. Now, you know Jane, and you know the kind of messages she typically received (mindless, at least in YOUR eyes). What are the chances that this message M is something that YOU will be interested in? Probably very low. The vast majority of email Jane receives is "crap," at least according to your definition, and so the very fact that Jane received message M greatly increases the likelihood that it is "crap."
Does that make better sense?
No (Score:0, Informative)
Re:Yet another wrong answer... (Score:4, Informative)
Re:x100 improvement in accuracy? (Score:3, Informative)
Re:x100 improvement in accuracy? (Score:3, Informative)
If the old way caught 95% and a new way catches 99%, the you could say it's 4.2% better (4/95) or 4 percentage points better or you could say it's gone from missing 5% to missing 1% for 80% better (4/5) or say it's 5 times better (1% missed compared with 5%). Guess which most people choose to use?
Re:Ummmm.... (Score:3, Informative)
They're using LOTS of accounts to grade e-mail. It doesn't work at all unless you're an ISP with lots of different accounts to monitor. The idea is that if a bunch of people get the same e-mail (already a good indicator of it's spaminess), if people who get lots of spam are more likely to have received it than people who don't get much spam at all, the message is more likely spam.
You are also totally wrong (Score:1, Informative)
Sorry for AC but i've already moderated in this discussion.
Re:Kinda flawed (Score:3, Informative)
Only the mail relay IP address can be determined unambiguously - that's the host which is connecting to the host which is checking the mail for spamminess.
The inventor responds... (Score:5, Informative)
It would be difficult for me to answer each and every comment, so I'll try to just hit the high points here.
It's quite easy to poke fun at an algorithm which is unknown to you as demonstrated by all the comments.
But what's more relevant is whether really smart people who know the algorithm can find fault with it. There are only two people outside of Abaca who know the algorithm: Stephen Wolfram (author of Mathematica) and University of Waterloo Professor Gordon V. Cormack (a well known figure in the anti-spam community). I picked Wolfram because he's the smartest pure math guy I know. I picked Cormack because I think is one of the smartest and most respected scientists in the spam field. You could contact either of them and ask them what they think of the approach. I can tell you what they'd say if you did that. They'd tell you it is a simple, elegant algorithm that has no obvious (to them) holes. I know that because the reason I disclosed it to them was to see if I overlooked anything. Neither found any holes. That doesn't prove that there aren't holes. All systems have holes. What this does mean is that a couple of pretty respected experts think it appears to be pretty solid logic.
In fact, Gordon was kind of enough to go even further and gave me permission to use the follow quote: "This is, by far, the most clever technique I'm aware of for spam filtering." Since Gordon is conference chair for a lot of spam conferences, this is a pretty significant endorsement from someone who KNOWS the full algorithm and who knows the spam space better than just about anyone.
I spent about 4 years studying what others had done in the space. As one commenter pointed out, the recipient reputation system can be thought of as a generalization of the honeypot technique that was first patented by Brightmail.
That's exactly right. My realization is that every email address has statistical value, not just honeypots. So instead of just "black" feedback, the system incorporates "grey" and "white" feedback; every recipient has an apriori odds associated with receiving mail. For many years, Brightmail was the "defacto" standard for spam filtering. Brightmail is just a special case of the algorithm I invented. So instead of learning from honeypots, we learn from ALL recipients and incorporate that statistical input in a mathematically rigorous way in order compute a statistical likelihood that our prediction was correct. That gives us much more input than a honeypot system: it gives us white, black, and grey values. That is critical to avoiding false positives because good sites (like Yahoo and Hotmail) send email to honeypots all the time. And we incorporate that feedback into a statistical framework that is much more accurate than what Brightmail used.
Exactly how we incorporate that input into spam scoring has not been publicly disclosed. It is not obvious.
People who say that this must be snake oil or cannot work ignore the fact that the system has been in use by real customer for more than a year. We have over 100 customers and are just annoucing our existence to the world, so that number should increase quite rapidly now that we are starting to market our product. There are customer testimonials on our website. You can contact them directly to verify that these quotes are legitimate.
Here are statistics from one of our rating servers. There were 1,380,140 messages since the last counter reset. 96% were rated spam. There were 176 false positives and 66 false negatives reported. I just grabbed those stats from one of our live servers right now as I was composing this message. Sometimes we're better, sometimes we're worse, but those numbers are pretty typical.
It's not perfect, but I think those are pretty good error rates for where we are now. And the stats always get better as we add more customers since we get more statistical input and this is just a statistical estimation problem. The more data, the more accurate
You don't know how SMTP works (Score:2, Informative)
> of a particular message, but spammers can easily ensure
> that information isn't easily obtained.
Nonsense. You're confusing the body from/to with the envelope from/to.
Spammers can't hide the envelope from/to.
Re:The inventor responds... (Score:3, Informative)
I have to say that that is the dumbest remark about software design I've ever heard. I've worked with lots of really smart people and I've seen them all miss bugs that were obvious to other people. Wolfram recently missed an error in a proof, for example.
It's more useful to have a lot of reasonably smart people look at something than have TWO (2) supposedly "really smart" people.
But, anyway, spam is a solved issue for me - I use greylisting and get maybe 1 spam per week. I can imagine a system that reduces that to 1 per month but I don't care enough to go out of my way to install such a system. Greylisting maintained that level of protection at the start of last year where I had over a million attempted deliveries over a six month period, so I just don't see the need for anything more complex.
Plus, I don't have to spend ANY time managing my email on most days, with a peak of activity on a day when spam gets through of having to press "delete".
Exactly how we incorporate that input into spam scoring has not been publicly disclosed.
Then its worthless. You're asking us to trust that YOU will find the holes and fix them before the spammers find them and exploit them. No deal; I don't care how smart your friends are, a botnet getting updated with an exploit for your private project would be a nightmare and I can't fix it if it happens while you're in bed or on holiday.
TWW
Re:Yet another wrong answer... (Score:3, Informative)
Do you know how much it costs your ISP to run the mail infrastructure for your legitimate mail?
Triple it. That's the cost of spam.
Re:You are also totally wrong (Score:4, Informative)
But its really designed to be a corporate product. So even if the each spam email contains only one recipient they all go through the corporate email server, allowing it see all the various recipients a given sender is emailing.
And there were even hints that the software stored on your corporate mail server might be sharing some information with a central data store, allowing it to get the score of all the recipients that the sender is sending to on any network that is a customer of this product. (So it doesn't matter so much if your company only has 10 people to average across because it is somehow cross checking against the global dataset which is tens of thousands of recipients.)
Re:Is linux for homos? (Score:4, Informative)
Not all homosexuals are happy, cheerful people either.