Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Spam

Plan for Spam, Version 2 464

bugbear writes "I just posted a new version of the Plan for Spam Bayesian filtering algorithm. The big change is to mark tokens by context. The new version decreases spams missed by 50%, to 2.5 per 1000, even though spam has gotten harder to filter since the summer. I also talk about how spam will evolve, and what to do about it."
This discussion has been archived. No new comments can be posted.

Plan for Spam, Version 2

Comments Filter:
  • by Anonymous Coward on Tuesday January 21, 2003 @02:22PM (#5128106)
    Simply use a free account for any registration required sites / internet posting and only check it when necessary to confirm registration.

    Use another account for regular everyday things, and make sure it sin't something simple like abc123@hotmail.com. I do that and never get spam to my real accounts. This whole spam thing is way overblown.
  • hopeless (Score:-1, Informative)

    by tps12 ( 105590 ) on Tuesday January 21, 2003 @02:27PM (#5128148) Homepage Journal
    I'm sort of impressed to see people still plugging away at the Bayesian spam filter problem. It's admirable to see that kind of preserverance in programmers.

    For those coming late to the story, Joel Sponsky demonstrated in his well known column [joelonsoftware.com] recently that Bayesian filtering of spam is an intractible problem. Until we have quantum computers, we're stuck with black lists, which work pretty well anyway.

    But keep plugging away guys. Who knows, maybe Joel's wrong.
  • Spam of the Future! (Score:2, Informative)

    by zulux ( 112259 ) on Tuesday January 21, 2003 @02:30PM (#5128165) Homepage Journal
    The real scarry part of the article is about, what he called, "Spam of the Future". It's really interesting. Basically, is a spam message that has a lot of seemingy normal text, that won't get caught in the spam filter. Because it IS normal text. It's then followed by a link - ususally to a porn site.

    Here is your opt-in FREE! porn! [goatse.cx]
  • by ajs ( 35943 ) <ajsNO@SPAMajs.com> on Tuesday January 21, 2003 @02:32PM (#5128179) Homepage Journal
    The latest development Spamassassin has an interesting application of Bayesian filtering. Basically, it takes all of SA's existing heuristics, uses that to develop a sense of what is and is not spam, and then pumps the results through a Bayesian filter that learns from these messages.

    As with any other SA test, no single element of the chain is trusted enough to definitively call something spam, but if a message would have squeeked through before, this new filter can put the final nail in its coffin through word analysis against previous spam.

    So, why did I use a subject about "ENDING spam"? Because one of the tools that spammers have is SA itself. They can use it to score their messages and determine how "spamish" it is. The problem now is that each SA installation will have subtly different scoring, and the message may be "ok" according to the spammer's version, but my version has a better sense of the mail that *I* get.

    SpamAssassin is definitely a tool worth checking out if you have not already. Install it in daemon mode (spamd) and then use "spamc -f" in your procmailrc or the equiv for your MTA.

    Very nice tool, and a real time-saver for me.
  • by Silas ( 35023 ) on Tuesday January 21, 2003 @02:41PM (#5128252) Homepage
    I'm really excited about all of the neat stuff happening with Bayesian filtering and related technologies, but I just wanted to put in a plug for TMDA [tmda.net], Tagged Message Delivery Agent, which uses a whitelist-centric strategy. Since I began using it, the amount of spam I have to look at is virtually at zero. If you haven't read about it yet, check it out.
  • Re:Stop spam? (Score:2, Informative)

    by Mournblade ( 72705 ) on Tuesday January 21, 2003 @02:41PM (#5128253) Homepage
    Just curious - did you follow up w/ him to see *why* he thought you signed up to receive the spam? Is it possible that you inadvertantly allowed them to send you spam the last time you renewed your driver's license? I ask because most of the spams I get say "you signed up with one our partner sites" and i've always wanted to (but have been too lazy to) go back and see how far up the chain I could get.
  • Spam Archive (Score:4, Informative)

    by Doctor Beavis ( 571080 ) on Tuesday January 21, 2003 @02:42PM (#5128264)
    The article mentions compiling a vast collection of spam. Such a project is already underway at SpamArchive [spamarchive.org].
  • by GGardner ( 97375 ) on Tuesday January 21, 2003 @02:46PM (#5128297)
    A common thing that spammers to do try and trick filters is use

    Content-Type: text/html (or text/plain)
    Content-Transfer-Encoding: base64

    Because a lot of filters don't know how to decipher this. For me, this makes it a lot easier to filter, though. I get no legitimate e-mail encoded this way, so I just have procmail dump any e-mail encoded this way. Problem solved, and without the CPU burden of decoding or running expensive spam filters.

  • popfile URL (Score:5, Informative)

    by roalt ( 534265 ) <slashdot DOT org AT roalt DOT com> on Tuesday January 21, 2003 @02:46PM (#5128300) Homepage Journal
    Popfile can be installed as an intermediate between your mail-server and your program, and you can add tags to your mail to decide in which 'bucket' your mail belongs to.

    The url for the project is popfile.sourceforge.net [sourceforge.net]

    I didn't try it yet, but it I will try it really soon now!

  • by minas-beede ( 561803 ) on Tuesday January 21, 2003 @02:49PM (#5128318)
    OK, signal and noise. What if the signal was all in one frequency band and the noise all in another. Problem separating them? No.

    What if, in effect, a similar distinction held for spam in the transmission channel - that spam by itself selected a pathway to the recipient that was never used by the signal? Block that pathway and the spam never gets through.

    Spam doesn't select a pathway but spammers do. If you could block relay spam at the open relays it would be dead. You can't, of course - the open relays are controlled by people who don't know the need to block spam. You know that, I know that. If you can't change the people then change the open relays (from the spammers' points of view.) Set up a system that looks like an open relay and stop the spam. An open relay honeypot.

    I asked an operator of such a honeypot how he did last year:

    > How did 2002 end?

    From March 7 to December 26 2002, the total was:

    235,624,232

    Using one Pentium 90 he stopped spam to 235 million recipients. Think about that number when you see filter people reporting what they stop just for their own domains. This was spam to recipients all over, not simply to the honeypot operators domain: he operates at the relay level. He stopped 100% of the spam, no deception deceived him, no tuning was needed, no valid email was caught - it is perfect filtering. Perfect filtering - who else has that?

    And you can do it at home on your DSL or cable connection (the guy above uses sendmail -bd, but Windows users have a program they can use):

    http://jackpot.uk.net/

    Yeah, I know, spammers are switching to open proxies. So, write an open proxy honeypot. That, too, will be 100% efficient. In addition you now are giving spammers reason to fear every open relay and every open proxy they detect. FEAR. The SPAMMERS have to scramble. They have to scramble and they have to show everything they do to overcome the technique - there is no stealth way to look for open relays and open proxies.

    The problem is solved, it is a matter of implementation and of getting active systems everywhere in the net space (so there's no safe IP space for the spammers anywhere.)

    Remember: A single Pentium 90, 235 million spam messages stopped in 10 months.
  • by tbmaddux ( 145207 ) on Tuesday January 21, 2003 @02:52PM (#5128346) Homepage Journal
    This is all quite interesting from a technical standpoint, but what can I gain as a user of Mail.app in MacOS X 10.2 (Jaguar) from this? My Junk filter catches spam and tosses it into a separate folder. I occasionally go through it and send the spam off to SpamCop. What I like about Mail.app is that it's easy to keep training by marking as Junk (for spam it failed to identify) or Not Junk (for occasional false positives). It seems to work well and doesn't require a lot of interaction from me except for interacting with SpamCop (my choice).

    It doesn't catch all the spam, and it occasionally has a false positive. This will be true of any spam filter we implement, because spam continues to change. SpamAssassin runs on some of the mailservers I connect to, but it tends to perform worse than Mail.app. So until we can get each user's spam filter customized at the server, spam identification is going to have to stay client-based. It sounds like Paul Graham's tools are getting a little more efficient, but does any of this make a big difference for the end user?

  • by siskbc ( 598067 ) on Tuesday January 21, 2003 @02:53PM (#5128350) Homepage
    ...until the email server at work got hacked and someone stole the entire address list. Since then, all of us have been getting spam by the bucketloads. And since I depend on people being able to get my current work address, I can't change it. Thank God for SpamAssassin!
  • by Anonymous Coward on Tuesday January 21, 2003 @02:53PM (#5128352)
    It's all fine and dandy to have a spamtrap account if you never plan to read it, but what if you want to get online bank statement notifications or other important notices? I just noticed my friendly credit card company (Capital One) took it upon themselves to introduce my previously spam-free e-mail account to their business partners so they could introduce me to the wonderful world of buying fucking flowers for valentines day. Thanks alot assholes. And no, they have NO option to opt out of this fucking crap. The spam is posted from the same address as the statement notifications with a friendly disclaimer saying they're not in any way affiliated. Nice.
  • by wessto ( 469499 ) on Tuesday January 21, 2003 @02:58PM (#5128383) Homepage
    I host several domains as a hobby for my family. Recently my ip address made it into a listing on spews.org. Am I a spammer? By no means. Am I screwed? Absolutely. After reading spamming newsgroups I found that I am not alone. At first I was just getting blocked because I was sending mail ( my own smtp server ) from a "known" spamming source when in fact I'm not a source of spam. My IP happens to fall into a larger block of ip's that my ISP owns, some of which are sources of spam.

    This was a minor setback, but now other services are starting to use bulk email sources as deny lists for their offerings. My free dns provider, zoneedit [zonedit.com] now prohibits me from adding / modifying any of my zones. This is simply not acceptible to me. The way spews is set up, it is not easy for my ip to get off the list. My ISP cannot just call them up and take me off. There has to be a way to avoid this, and eliminating spam at a higher level would be a good start.
  • sneakemail.com (Score:2, Informative)

    by Stalemate ( 105992 ) on Tuesday January 21, 2003 @03:01PM (#5128398)
    sneakemail.com is my new way of eliminating spam.
  • ifile (Score:1, Informative)

    by Eunuchswear ( 210685 ) on Tuesday January 21, 2003 @03:03PM (#5128419) Journal
    Why does no-one ever mention ifile [nongnu.org]? They seem to have been doing this for quite some time (since 1996?) and have a neat trick for avoiding all those boring "training" steps (you tell ifile how to classify messages by moving them into the folder you think they should be in).
  • by Silas ( 35023 ) on Tuesday January 21, 2003 @03:06PM (#5128432) Homepage
    It sounds like you're using TMDA [tmda.net]. Or, if you're not, you should be. :) Check out my related post on this story [slashdot.org].
  • by AssFace ( 118098 ) <stenz77@gmail. c o m> on Tuesday January 21, 2003 @03:15PM (#5128485) Homepage Journal
    I went through over 500 spam a day down to about 3 or so and I figured out that those last 3 are due to the fact that they are bypassing the filter (I have a bunch of different urls and the server that it is all hosted on also has its own name - so mail sent to that username at that host doesn't get sent through any filters and the way that the filters are setup there - pair.com - I can't trap that particular servername).

    I have been very impressed with SA and am writing scripts to track the stats even better (I love seeing what it has pulled out everyday).
    So far I have had zero false positives out of about 1-2megs of mail being filtered everyday for nearly a month now.

    SA has multiple different ways of searching the mail - any one of them can be easily bypassed by any given e-mail - but all of them together are really damn good at getting rid of spam.
    I'm very impressed with it and how well it learns (although straight "out of the box" - or perhaps I should say "straight out of the tar.gz" it brought me down from 500+ spam to 5-10 a day and then I tweaked how my accounts were filtering into SA and that fixed the rest.
  • by blakestah ( 91866 ) <blakestah@gmail.com> on Tuesday January 21, 2003 @03:25PM (#5128576) Homepage
    I think you misunderstand how easy bogofilter is.

    I initially trained on about 200 emails. At first, I got 1 spam per day, or so. There have not yet been any false positives (good mail classified as spam).

    A week later, I get 1 spam in my inbox every 3-4 days, and no good mail has been classified as spam. All I need to do it take the false identifications and re-classify them. That means, every 3-4 days I take the spam in my inbox and re-scan it through bogofilter (cat SPAM | bogofilter -S). That is all. It is not any effort, really, after the initial training. Then, the filter does all the work, and you don't need to worry about blacklisting or whitelisting or anything.

    The really important thing is that the filter statistically optimizes YOUR manual email classification. The best source of email classifying is YOU looking at an email, and Bayesian filtering is the only method that is optimized to do that.
  • by bugbear ( 448726 ) on Tuesday January 21, 2003 @03:44PM (#5128733) Homepage
    Wouldn't work. The algorithm only cares about the most statistically significant 15 words. TEENS easily beat yams.
  • by Anonymous Coward on Tuesday January 21, 2003 @04:00PM (#5128864)
    All you need to block spam:
  • by uncleFester ( 29998 ) on Tuesday January 21, 2003 @04:01PM (#5128871) Homepage Journal
    Another program is SpamPal [spampal.org.uk], which also acts as a pop proxy. It also has a plugin structure, and one of the plugins is a Bayesian filter. This is in addition to included support for using available spam blacklist stuff like SPEWS, ORDB, SpamCop and a whole bunch of other DNSBL lists (even the ability to block entire domains like .kr, .ch and so on). It's a rather cool piece of software.
  • Re:popfile URL (Score:2, Informative)

    by joeldg ( 518249 ) on Tuesday January 21, 2003 @04:14PM (#5128969) Homepage
    Popfile rocks.. used it for a while 89% accuracy.. but the 11% is actually relatives/friends sending me stupid forwards, so in reality is is about 99% accurate.. nice..
  • by jellomizer ( 103300 ) on Tuesday January 21, 2003 @04:43PM (#5129221)
    Without using filtering software.

    1. Change your e-mail address and drop the old one. (This way you are starting off with a clean slate and not on any mailing lists.)

    2. Make sure your ISP dosent post or sell your e-mail address.

    3. Make your email address simple for people to rember but hard for a computer to crack example m1nam3@isp.com. Use simular methods as you would in making a password. That prevents common name email address.

    4. On your webpage make a CGI/PHP/ASP whatever form to send you an e-mail. When you want people to e-mail you give them the link to that page. Make sure that there are no prameters that can make your program e-mail others, and also that your e-mail address is not listed in any of the source that is visable to the web user.

    5. Only give your e-mail to people you can relitvly trust. If you cant trust them then give them a link to you weppage.

    6. When filling out forms on the network asking for your e-mail ether use an alternate e-mail or read the companies privicy clames and make sure that you do not check or uncheck something stating that they will send you e-mail or adds.

    7. Use spamassasan or other email filtering on your system.

    8. Forward all spam to ucs@ftc.gov with all the headers.

    9. See if your email client has a automatic bounce back. If so bounce the message back to sender.

    10. if you want to post your e-mail address then I would make a graphical jpg, png as your e-mail. That way it slows down most computers from reading it.
  • by billstewart ( 78916 ) on Tuesday January 21, 2003 @04:45PM (#5129238) Journal
    Vipul's Razor [sourceforge.net] on Sourceforge is the canonical collaborative spam filter network. These things really do make a dent in spammers constructing not-very-spam-looking messages that sneak through filters, because to get around them, they need to send sufficiently different messages to each target, though the openness of the matching algorithm means they do have the tools to try it.

    One of my ISPs's implementation of SpamAssassin seems to be using it as part of their rating heuristic.

  • by ajs ( 35943 ) <ajsNO@SPAMajs.com> on Tuesday January 21, 2003 @04:59PM (#5129360) Homepage Journal
    Incorrect. SA is using that technique (and has for a fairly long time now) centrally to generate their score lists. That's important, and it's a very strong part of SA.

    However, in the next release of SA (and I'm currently running it out of CVS, so it's hardly vapor), they will *also* be using full word scoring heuristics. That scoring will result in a boolean "spamishness" which will in turn be assigned a score centrally (whihc users can override, of course).

    By way of example, here's a recent summary of one of my pieces of spam:

    Content analysis details: (12.50 points, 4 required)
    NO_REAL_NAME (1.3 points) From: does not include a real name
    INVALID_DATE (1.6 points) Invalid Date: header (not RFC 2822)
    BAYES_90 (2.0 points) BODY: Bayesian classifier says spam probability is 90 to 99%
    [score: 0.9645]
    RAZOR2_CF_RANGE_91_100 (0.0 points) BODY: Razor2 gives a spam confidence level between 91 and 100
    [cf: 100]
    RAZOR2_CHECK (3.9 points) Listed in Razor2, see http://razor.sf.net/
    DATE_IN_PAST_03_06 (0.2 points) Date: is 3 to 6 hours before Received: date
    MSG_ID_ADDED_BY_MTA_3 (2.0 points) 'Message-Id' was added by a relay (3)
    FORGED_MUA_OUTLOOK (1.0 points) Forged mail pretending to be from MS Outlook
    MISSING_MIMEOLE (0.5 points) Message has X-MSMail-Priority, but no X-MimeOLE

    As I said previously, the interesting part here is not the word-analysis, but the fact that the database for that word analysis is generated dynamically by looking at your mail, and applying SA's other rules. Self-training of this sort has proven highly successful in tests, and may yield the next quantum of spam-filtering effectiveness.

    Notice also that while that 2.0 points from Bayes is a big push to this spam's score, it's not enough to mark it as spam on it's own. This is the power of SpamAssassin. No one test says, "this is spam", and so no one test is trusted on its own.
  • Please take a look at my notes on last week's spam conference [slashdot.org], and in particular the Jon Praed notes (near the end; two speakers came after him).

    Praed argued, very eloquoently & persuasively (hey, he's a lawyer :) that there are laws on the books banning spam in nearly every state. All you have to do is find a way to bring those laws to your assistance. In particular, note that:

    • Ever have a hard time tracking down a spammer? Ever have one that spoofed message headers? Gee, that sounds like fraud, doesn't it? Indeed it does -- much or even all spam can be considered as fraud, and as such you can attack it from that angle anywhere in the country.
    • Laws are pending in various jurisdictions to outlaw spammers' bulk mail software. The catch here is that there is a lot of legitimate bulk mail software that can be abused -- think majordomo, MailMan, etc -- so any laws crafted will have to include clauses that protect legitimate use of such software while banning UCE somehow. Watch for this to develop over time.
    • Suggestion: if you get spam that mentions a trademarked product (Viagra, pirated copies of well known software, etc), forward the message to the holder of that trademark. They will almost always be keenly interested in this abuse of their trade name, and will take it upon themselves to go after the spammer.
    • If you are in the habit of reporting spam to an organization like SpamCop [spamcop.net], do so as quickly as possible: spammers are getting in the habit of leaving their ads up long enough for recipients to respond to, but pulling them down before investigators get a chance to scrutinize anything. The faster these groups can analyze the sources of spam, the better the chances of getting all the way back to the source.
    • Final and most important point: the precedent set by the Verizon vs. Ralsky case was very valuable to anti-spam efforts. First, that spam prosecution can be carried out in the jurisdiction that the harm occurred, not where the person doing harm was when causing it. So if California has anti-spam laws, they can potentially be used no matter where the spammer lives. Praed practices law in Virginia, so I'm assuming that their laws are amenable to this kind of application. Second point: ignorance about an ISPs acceptable use policies (AUP) are no defence in court -- certain etiquette standards have emerged over time, and it is assumed that the sender of UCE has to be aware of these standards. As a result, if your ISP has an AUP that forbids UCE, this can be a tangible protection for you in court. This is very good news!

    As a lawyer that has successfully prosecuted a number of spammers, Praed was able to talk about all of this with some authority. He cautioned everyone though that laws will never eradicate spam -- as he put it, "people still rob banks since that's where the money is". But legislation & prosecution can still be a very valuable tool in fighting spam, and an important supplement to things like better mail filters. This is a big problem, and is going to need a variety of tiered solutions to control it.

  • by Thing 1 ( 178996 ) on Tuesday January 21, 2003 @05:50PM (#5129846) Journal
    On a different subject, in a story about a week ago, someone posted a link to a peer-peer network of spam emails for MS Outlook available at http://www.cloudmark.com that will trap a significant amount of emails based on (and this is overly simplified, of course) users' votes. Does such a solution exist in the open source world?

    Hi, that was me [slashdot.org] . Unfortunately this only works for Outlook (not even Outlook Express), but it's been working great for me.

    As others have pointed out, Vipul's Razor [sourceforge.net] is a great open-source solution.

    Checking SourceForge [sourceforge.net] , I found the following additional packages:

    BogoFilter [sourceforge.net]

    SpamAssassin [sourceforge.net]

    JoeEmail [sourceforge.net]

    Bayesian anti-spam classifier [sourceforge.net]

    Anti-Spam SMTP Proxy Server [sourceforge.net]

    Bayesian Mail Filter [sourceforge.net]

    JunkFilter [sourceforge.net]

    SpamProbe - fast bayesian spam filter [sourceforge.net]

    Mailfilter [sourceforge.net]

    IMAPAssassin [sourceforge.net]

    That's just from the first page of search results. If you'd like to see all the results (I did a search for "spam" from their search box), click here [sourceforge.net] .

  • by Vainglorious Coward ( 267452 ) on Tuesday January 21, 2003 @05:52PM (#5129873) Journal

    Without using filtering software.

    1. Change your e-mail address and drop the old one.

    Off to an ugly start. Joe Average will abort on your list before he's even begun

    2. Make sure your ISP dosent post or sell your e-mail address.

    I'd love to know how you're going to ensure this

    5. Only give your e-mail to people you can relitvly trust. If you cant trust them then give them a link to you weppage.

    "No mom, you can't have my email address. You just use it to send me e-greetings and I hate getting those from you..."

    6. When filling out forms on the network asking for your e-mail ... read the companies privicy clames and make sure that you do not check or uncheck something stating that they will send you e-mail or adds.

    Spammers lie. We wouldn't have all these problems if spammers were truthful

    7. Use spamassasan or other email filtering on your system

    How do I do that "without using filtering software" ?

    8. Forward all spam to ucs@ftc.gov with all the headers.

    You mean uce@ftc.gov. Also note that (depending on the email client) just forwarding a message usually destroys the headers of interest.

    9. See if your email client has a automatic bounce back. If so bounce the message back to sender.

    How exactly does sending a response to an address that either (a) doesn't exist, (b) exists, but is irrelevant (joe-job), or (c) is an address-validation mechanism, help anything?

    10. if you want to post your e-mail address then I would make a graphical jpg, png as your e-mail. That way it slows down most computers from reading it

    This one I can't find fault with :) (but note there will be some people get confused/annoyed when they can't just click on a mailto: link, I'm just not of them).

  • by Jadrano ( 641713 ) on Tuesday January 21, 2003 @08:17PM (#5131067)
    It's not just about the US, in many European countries spam is illegal already now (clear cases are Norway and Austria), and the European Union as a whole has decided to outlaw spam, it should be implemented this year. I don't know exactly about the situation in East Asia, but I don't think the Chinese and Koreans like it too much that their resources are misused for sending spam all over the world, so they could follow soon. Yes, there certainly will be some smaller countries where spam is still legal, but once spam is illegal in the European Union, the United States, China and many other big countries no one who has sent thousands of spam mails to harvested addresses can reasonably claim that he or she believed that all the addresses were only of people in a few offshore countries.
    Furthermore, the US American conception of law has, as far as I know, the principle of being applicable exterritorially, which is in general quite controversial, but could be useful here - it would probably be possible to forbid any companies that do business in the US to send spam, even if the spam is only sent from other countries and only to people living outside the United States.

Old programmers never die, they just hit account block limit.

Working...