Spam Trap Claims 10x-100x Accuracy Gain

Spam Trap Claims 10x-100x Accuracy Gain 419

Posted by kdawson on Monday December 03, 2007 @11:31PM from the see-it-when-i-believe-it dept.

SpiritGod21 writes in with a NYTimes article on a new approach to spam detection that claims out-of-the-box improvement of 1 or 2 orders of magnitude over existing approaches. The article wanders off into human-interest territory as the inventor, Steven T. Kirsch, has an incurable disease and an engineer's approach to fighting it. But a description of the anti-spam tech, based on the reputation of the receiver and not the sender, is worth a read.

Spam Trap Claims 10x-100x Accuracy Gain

This discussion has been archived. No new comments can be posted.

Search 419 Comments Log In/Create an Account

Comments Filter:

Makes sense (Score:5, Informative)

by Dan East ( 318230 ) writes: on Monday December 03, 2007 @11:45PM (#21567785) Journal

I own a number of domains, and receive all email to each domain in a catch-all account. I receive a great deal of emails to totally fictitious email accounts at my domains. Those recipients receive 0% legitimate emails, so anything sending to those accounts is 100% certainly a spammer. Basically what Abaca is doing is working with all the shades of gray in between. Also, this is a system that can only be employed at the server level. It's not like you could add this technology to your stand alone email client.

Dan East

Re:x100 improvement in accuracy? (Score:4, Informative)

by Dan East ( 318230 ) writes: on Monday December 03, 2007 @11:47PM (#21567803) Journal

Misquoted by the Slashdot story as usual. FTA:
Over 99 percent spam blocking means fewer than one mistake in every 100 messages processed. That's 10 to 100 times fewer mistakes than any other available systems.

Dan East

Re:KInda flawed (Score:5, Informative)

by pclminion ( 145572 ) writes: on Monday December 03, 2007 @11:51PM (#21567825)

So, if I understood the article correctly, this technology will classify more email as spam the more spam you have received.

No, that's not how it works at all. Let me try putting it as a concrete example. You have a friend, Jane, who likes to swap stupid chain emails, subscribes to all kinds of "voluntary spam," and generally receives 1000 spam mails a day. Jane's a great lady, don't get me wrong, but you know the type of person I mean. You talk to her in real life, but over email she is incredibly annoying, as most of her messages are essentially meaningless.

Now, let's say that BOTH YOU AND JANE receive the same message M. Now, you know Jane, and you know the kind of messages she typically received (mindless, at least in YOUR eyes). What are the chances that this message M is something that YOU will be interested in? Probably very low. The vast majority of email Jane receives is "crap," at least according to your definition, and so the very fact that Jane received message M greatly increases the likelihood that it is "crap."

Does that make better sense?

No (Score:0, Informative)

by Anonymous Coward writes: on Monday December 03, 2007 @11:53PM (#21567839)

It totally takes how much legitimate email each individual gets into account. What they are saying is that if 30% of the emails I receive are usually spam, then my personal spam filter should mark about 30% of my email as spam. It should sort my mail based on how spammy it looks and then kill the top 25%, pass through the bottom 65%, and maybe give some extra scrutiny to the middle 10%. It's a pretty interesting idea.

Re:Yet another wrong answer... (Score:4, Informative)

by wizardforce ( 1005805 ) writes: on Tuesday December 04, 2007 @12:04AM (#21567907) Journal

how do you propose we remove the economic incentive for spam? ok let's see how this has been attempted or hypothesized in the past: charge a fee per email rather than a blanket fee from the ISP for access. ok but most of the real spam that is being sent is done through compromised PCs so attacking the problem by charging a fee per email is useless because the people in control of this spam-net are not the ones paying for bandwidth/email fees. ok then pass laws against it. that doesn't work either, the remaining spam-nets will still work because it can not be enforced in the host country let alone all those who are not subject to the law. ok then build better spam traps. tried that, it isn't doing so well- spam is still getting through in large numbers. educate people? that will certainly make things better in a lot of ways but there will still be that twat that actually wants to get spam... have ISPs cut off high bandwidth connections from those suspected of spamming? can anyone say privacy nightmare? as much as I hate spam I hate the idea of ISPs snooping through your email no matter what their reasons are. now what?

Re:x100 improvement in accuracy? (Score:3, Informative)

by teh moges ( 875080 ) writes: on Tuesday December 04, 2007 @12:10AM (#21567953) Homepage

No. If previous methods let through one in 100 (1%) then a 10x improvement would result in one in 1000 getting through (0.1%).

Re:x100 improvement in accuracy? (Score:3, Informative)

by sholden ( 12227 ) writes: on Tuesday December 04, 2007 @01:34AM (#21568499) Homepage

They always measure it backwards,since it makes the numbers sound much better...

If the old way caught 95% and a new way catches 99%, the you could say it's 4.2% better (4/95) or 4 percentage points better or you could say it's gone from missing 5% to missing 1% for 80% better (4/5) or say it's 5 times better (1% missed compared with 5%). Guess which most people choose to use?

Re:Ummmm.... (Score:3, Informative)

by ceoyoyo ( 59147 ) writes: on Tuesday December 04, 2007 @01:49AM (#21568597)

Not quite (the AC who replied and got modded up is also incorrect).

They're using LOTS of accounts to grade e-mail. It doesn't work at all unless you're an ISP with lots of different accounts to monitor. The idea is that if a bunch of people get the same e-mail (already a good indicator of it's spaminess), if people who get lots of spam are more likely to have received it than people who don't get much spam at all, the message is more likely spam.

You are also totally wrong (Score:1, Informative)

by Anonymous Coward writes: on Tuesday December 04, 2007 @03:32AM (#21569133)

You have got the system completely BACKWARDS.

Sorry for AC but i've already moderated in this discussion.

Re:Kinda flawed (Score:3, Informative)

by elronxenu ( 117773 ) writes: on Tuesday December 04, 2007 @04:04AM (#21569273) Homepage

It's not possible to reliably determine the originating sender's IP address, because this would have to come from the message headers, and the sender of a message can forge those headers to say anything it likes. The original IP address could be behind RFC1918 address space (like mine) or simply be fake.
Only the mail relay IP address can be determined unambiguously - that's the host which is connecting to the host which is checking the mail for spamminess.

The inventor responds... (Score:5, Informative)

by propelCEO ( 661892 ) writes: on Tuesday December 04, 2007 @04:11AM (#21569301) Homepage

Thank you for all the comments on the NY Times article.

It would be difficult for me to answer each and every comment, so I'll try to just hit the high points here.

It's quite easy to poke fun at an algorithm which is unknown to you as demonstrated by all the comments.

But what's more relevant is whether really smart people who know the algorithm can find fault with it. There are only two people outside of Abaca who know the algorithm: Stephen Wolfram (author of Mathematica) and University of Waterloo Professor Gordon V. Cormack (a well known figure in the anti-spam community). I picked Wolfram because he's the smartest pure math guy I know. I picked Cormack because I think is one of the smartest and most respected scientists in the spam field. You could contact either of them and ask them what they think of the approach. I can tell you what they'd say if you did that. They'd tell you it is a simple, elegant algorithm that has no obvious (to them) holes. I know that because the reason I disclosed it to them was to see if I overlooked anything. Neither found any holes. That doesn't prove that there aren't holes. All systems have holes. What this does mean is that a couple of pretty respected experts think it appears to be pretty solid logic.

In fact, Gordon was kind of enough to go even further and gave me permission to use the follow quote: "This is, by far, the most clever technique I'm aware of for spam filtering." Since Gordon is conference chair for a lot of spam conferences, this is a pretty significant endorsement from someone who KNOWS the full algorithm and who knows the spam space better than just about anyone.

I spent about 4 years studying what others had done in the space. As one commenter pointed out, the recipient reputation system can be thought of as a generalization of the honeypot technique that was first patented by Brightmail.

That's exactly right. My realization is that every email address has statistical value, not just honeypots. So instead of just "black" feedback, the system incorporates "grey" and "white" feedback; every recipient has an apriori odds associated with receiving mail. For many years, Brightmail was the "defacto" standard for spam filtering. Brightmail is just a special case of the algorithm I invented. So instead of learning from honeypots, we learn from ALL recipients and incorporate that statistical input in a mathematically rigorous way in order compute a statistical likelihood that our prediction was correct. That gives us much more input than a honeypot system: it gives us white, black, and grey values. That is critical to avoiding false positives because good sites (like Yahoo and Hotmail) send email to honeypots all the time. And we incorporate that feedback into a statistical framework that is much more accurate than what Brightmail used.

Exactly how we incorporate that input into spam scoring has not been publicly disclosed. It is not obvious.

People who say that this must be snake oil or cannot work ignore the fact that the system has been in use by real customer for more than a year. We have over 100 customers and are just annoucing our existence to the world, so that number should increase quite rapidly now that we are starting to market our product. There are customer testimonials on our website. You can contact them directly to verify that these quotes are legitimate.

Here are statistics from one of our rating servers. There were 1,380,140 messages since the last counter reset. 96% were rated spam. There were 176 false positives and 66 false negatives reported. I just grabbed those stats from one of our live servers right now as I was composing this message. Sometimes we're better, sometimes we're worse, but those numbers are pretty typical.

It's not perfect, but I think those are pretty good error rates for where we are now. And the stats always get better as we add more customers since we get more statistical input and this is just a statistical estimation problem. The more data, the more accurate
Read the rest of this comment...

You don't know how SMTP works (Score:2, Informative)

by Anonymous Coward writes: on Tuesday December 04, 2007 @05:10AM (#21569543)

> The big assumption is that you can identify the recipients
> of a particular message, but spammers can easily ensure
> that information isn't easily obtained.

Nonsense. You're confusing the body from/to with the envelope from/to.

Spammers can't hide the envelope from/to.

Re:The inventor responds... (Score:3, Informative)

by nagora ( 177841 ) writes: on Tuesday December 04, 2007 @07:03AM (#21570001)

But what's more relevant is whether really smart people who know the algorithm can find fault with it.
I have to say that that is the dumbest remark about software design I've ever heard. I've worked with lots of really smart people and I've seen them all miss bugs that were obvious to other people. Wolfram recently missed an error in a proof, for example.
It's more useful to have a lot of reasonably smart people look at something than have TWO (2) supposedly "really smart" people.
But, anyway, spam is a solved issue for me - I use greylisting and get maybe 1 spam per week. I can imagine a system that reduces that to 1 per month but I don't care enough to go out of my way to install such a system. Greylisting maintained that level of protection at the start of last year where I had over a million attempted deliveries over a six month period, so I just don't see the need for anything more complex.
Plus, I don't have to spend ANY time managing my email on most days, with a peak of activity on a day when spam gets through of having to press "delete".
Exactly how we incorporate that input into spam scoring has not been publicly disclosed.
Then its worthless. You're asking us to trust that YOU will find the holes and fix them before the spammers find them and exploit them. No deal; I don't care how smart your friends are, a botnet getting updated with an exploit for your private project would be a nightmare and I can't fix it if it happens while you're in bed or on holiday.
TWW

Re:Yet another wrong answer... (Score:3, Informative)

by nuzak ( 959558 ) writes: on Tuesday December 04, 2007 @12:42PM (#21573007) Journal

> I'm sorry, but I really fail to appreciate the harm done to me by receiving a handful of viagra emails every now and then.

Do you know how much it costs your ISP to run the mail infrastructure for your legitimate mail?

Triple it. That's the cost of spam.

Re:You are also totally wrong (Score:4, Informative)

by Jonathan_S ( 25407 ) writes: on Tuesday December 04, 2007 @12:58PM (#21573293)

But doesn't this assume that the spam is addressed to multiple recipients? 99% of the spam I get is addressed only to me
I think the confusion here is that you (and many other posters) are trying to evaluate this as a personal anti-spam product.

But its really designed to be a corporate product. So even if the each spam email contains only one recipient they all go through the corporate email server, allowing it see all the various recipients a given sender is emailing.

And there were even hints that the software stored on your corporate mail server might be sharing some information with a central data store, allowing it to get the score of all the recipients that the sender is sending to on any network that is a customer of this product. (So it doesn't matter so much if your company only has 10 people to average across because it is somehow cross checking against the global dataset which is tens of thousands of recipients.)

Re:Is linux for homos? (Score:4, Informative)

by myowntrueself ( 607117 ) writes: on Tuesday December 04, 2007 @03:11PM (#21575491)

Linux is not gay, homosexuals are gay.

Not all homosexuals are happy, cheerful people either.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Spam Trap Claims 10x-100x Accuracy Gain 419

Spam Trap Claims 10x-100x Accuracy Gain More Login

Spam Trap Claims 10x-100x Accuracy Gain

Makes sense (Score:5, Informative)

Re:x100 improvement in accuracy? (Score:4, Informative)

Re:KInda flawed (Score:5, Informative)

No (Score:0, Informative)

Re:Yet another wrong answer... (Score:4, Informative)

Re:x100 improvement in accuracy? (Score:3, Informative)

Re:x100 improvement in accuracy? (Score:3, Informative)

Re:Ummmm.... (Score:3, Informative)

You are also totally wrong (Score:1, Informative)

Re:Kinda flawed (Score:3, Informative)

The inventor responds... (Score:5, Informative)

You don't know how SMTP works (Score:2, Informative)

Re:The inventor responds... (Score:3, Informative)

Re:Yet another wrong answer... (Score:3, Informative)

Re:You are also totally wrong (Score:4, Informative)

Re:Is linux for homos? (Score:4, Informative)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot