Stories
Slash Boxes
Comments

News for nerds, stuff that matters

Slashdot Log In

Log In

Create Account  |  Retrieve Password

New Kind of Spam 'Un-Training' Filters?

Posted by ScuttleMonkey on Wed Aug 09, 2006 12:53 PM
from the battle-lines-being-drawn dept.
Zaphod2016 writes to tell us the Wall Street Journal is reporting that email in-boxes are under a new kind of spam attack. This new spam has confused many people due to its lack of advertising, viruses, or request for personal information. One popular theory is that these innocuous blocks of text, often drawn from popular literature, are being used to "un-train" spam filters to allow more malicious spam through in the future.
+ -
story

Related Stories

[+] Backslash: Wireless, Gaming Addiction, Spam, and More 45 comments
Of the thousands of comments on yesterday's Slashdot page, gathered below are some of the ones that defined the conversations on the rise of wireless peripherals, the meaning of content-free spam, whether one can be truly addicted to online gaming, and Intel's move to open source some of its graphics adapter drivers. Read on for the Backslash summary.
This discussion has been archived. No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More
Loading... please wait.
  • Other way around? (Score:5, Insightful)

    by Sepodati (746220) on Wednesday August 09 2006, @12:58PM (#15874827) Homepage
    Wouldn't it work the other way around? I still flag crap like this as spam, so it seems like it'd train my spam filter to have more false positives, no?

    ---John Holmes...
    • Re:Other way around? (Score:5, Interesting)

      by pe1chl (90186) on Wednesday August 09 2006, @01:01PM (#15874855)
      At work our spamassassin bayes filter has heavily trained on English text always being spam.
      This is because English is not our local language, so almost no business communication is in English and most of the spam is.
      This indeed sometimes causes false positives when English language mail has other spam-like properties as well, and the added 3.5 points from the Bayes filter pushes it above the limit.

      This again shows that you should not use solely a Bayes filter as spam blocker.
      • Re:Other way around? (Score:5, Informative)

        by TubeSteak (669689) on Wednesday August 09 2006, @01:24PM (#15875072) Journal
        My limited experience is that whatever filtering Hotmail uses has been allowing lots of Spam to slip through in the last few weeks.

        Anyone else?
        How's Yahoo & G-Mail been doing?
        • Re:Other way around? (Score:5, Interesting)

          by Skynyrd (25155) on Wednesday August 09 2006, @01:40PM (#15875183) Homepage
          My limited experience is that whatever filtering Hotmail uses has been allowing lots of Spam to slip through in the last few weeks.

          Anyone else?
          How's Yahoo & G-Mail been doing?


          I use gmail, and although it's let one or two pieces of spam through in the last week, it's always been near 100%.

          I get 50-100 email a day on gmail.
      • Re:Other way around? (Score:5, Interesting)

        by ericlondaits (32714) on Wednesday August 09 2006, @01:30PM (#15875121) Homepage
        I Recommend that you subscribe to a couple of english language Mailing Lists (or Yahoo Groups), which you can then filter and move to a mail subfolder of their own easily through the Subject line or From Address. That way you can have good english non-spam mails going through your Bayes daily.
    • by khasim (1285) <brandioch.conner@gmail.com> on Wednesday August 09 2006, @01:13PM (#15874973)
      I still flag crap like this as spam, so it seems like it'd train my spam filter to have more false positives, no?
      No. Unless the people you usually corresponde with also include blocks of the same text.

      The only way to increase the false positives is to get the spam filter to learn the words that usually appear in your legitimate messages.

      Since the spammers have no way of knowing what those words are, there is no way they can bypass your filters ... and still be effective in getting through any one else's filters.
  • Vectorspaces (Score:5, Interesting)

    by bigattichouse (527527) on Wednesday August 09 2006, @12:59PM (#15874839) Homepage
    As a hobby, I play around with ways to classify spam. Not much of a hobby, but I find the problem interesting.

    Lately, I've also been trying to use my vectorspace engine to classify spam.. so these sorts of things might get in, but only because they fall into the general category of readable text...

    I've also been thinking about building a GPL tool to provide "sound-based" classification sort of like a "one second orchestra" playing in harmony/disharmony based on the content.

    Regardless of the engine I use, I still have to dig through my trash bin every few days to make sure nothing good slipped through.
  • by Scutter (18425) on Wednesday August 09 2006, @01:00PM (#15874842) Journal
    It is such animportant element, you see, that duration
    of time. I consider twelve hours a substantial measure. So I ran along
    the drive and upthe steps and into the house, but did not see either
    Mrs. Iobserved:Your Excellency is not easily satisfied. And I marvelled,
    and said:How comes it that I have hitherto been deaf to these
    distressfultones? Il passe sur la route, mais toujours en sens inverse.
    For a mental state such astheirs, appetency rather than instability is
    the right word. Which reminds me that the old adage about let us eat and
    drink, forto-morrow, etc. Mais odonc est la vie, sinon dans le peuple?
    They lamented dismally among themselves in many tongues:How I suffer!
    Take that little one on Lzards, for instance;or, in the other volume,
    the bizarre Joies Noires.
  • by sotweed (118223) on Wednesday August 09 2006, @01:00PM (#15874848)
    I've been getting 3 or 4 of these a day for at least a month now. The text can
    always be found in some file of an old book provided by the Gutenberg
    Project, which is making non-copyright texts available through volunteer
    effort.

    I think the theory about using this stuff to untrain spam filters is very plausible.
    But it's difficult to see how it will work. There's no common text among these
    e-mails; in order to send effective spam, there'll have to be at least some text which
    is the same across multiple mails, and that will tend to expose it.
  • by nweaver (113078) on Wednesday August 09 2006, @01:01PM (#15874858) Homepage
    The text block spam is very common WITH images . I suspect that what happened is some lame spammer got a BIG botnet contract, sent out his spam, and forgot to include the image.
  • Un-training? Hardly. (Score:5, Informative)

    by pclminion (145572) on Wednesday August 09 2006, @01:03PM (#15874879)

    Bayesian and other filters do not rely on "spammy" words alone -- they also rely on "unspammy" words, and spammers have no idea what those words are because each person receives different email.

    A scenario, with made up (but plausible) numbers: Suppose you're a developer of a Linux driver for the Bozodrive 1000. The majority of your legitimate email comes from Linux driver development mailing lists. A full 50% of those emails contain the word "IRQ." 99% of the emails contain the word "driver," and 15% contain the word "Johannsen" which is in the signature of one of your friends. And precisely 0% of the emails containing any of these terms have ever been found to be spam.

    Any decent spam filter will give a huge weight to the presence of these "unspammy" words, because of the extremely high probability of emails containing them to be non-spam. The presence of randomly selected confusion words in empty spams is not going to affect these frequency counts.

    In order to defeat a filter by confusing it, the spammer must guess what the SPECIFIC non-spam words for that PARTICULAR email user are, and then produce bogus, spam messages containing those words in the appropriate frequencies. This will cause the classification counts for those words to become more equalized, and the value of those words in determining spammyness to be greatly reduced. However, this is an impossible task unless the spammer has access to the actual emails of the target.

    Perhaps the intent of the empty spams is to confuse the filters, but whoever devised the method has no understanding of how these things actually work, whatsoever.

  • by nuzak (959558) on Wednesday August 09 2006, @01:08PM (#15874918) Journal
    The WSJ article also gives due time to the theory that the spamware is simply broken and that the spam payload is being delivered with the padding and not the payload. Since I've previously seen plenty of Gutenspam (my name for this spam that contains snips from Gutenberg texts) with an image payload attached, I'm definitely leaning toward the notion that they slipped somewhere and are now not delivering the image.

    Woe betide literature discussion groups now that filters are trained on the classics.

    • by Richard_at_work (517087) <richardprice&gmail,com> on Wednesday August 09 2006, @01:26PM (#15875093)
      I dont think this is the case, as Ive been getting these sorts of emails for at least 3 years (looking back at the spam archive I keep to train from) - random blocks of legible text, blocks of psuedo english (words are correct but theres no effort at sentence structure), even jokes on their own. I got intrigued by this about 6 months ago and wrote a few scripts to see if it was just a broken spam client forgetting to add the payload, but your average 'with payload' spam doesnt seem to match these emails, theres practically no similiar 'with payload' spams in my archive with these blocks of text.

      I always wrote it off as baysian filter poisoning.
  • by OwlWhacker (758974) on Wednesday August 09 2006, @01:10PM (#15874942) Homepage Journal
    I have seen quite a number of corrupt e-mails coming from spammers. Occasionally you find the subject is merely %%SUBJECT%%, or an e-mail has entered your system consisting of just the headers and no body.

    My theory is that there are more people attempting to use spamming applications, and many of these people don't have a clue what they're doing. You'll probably find that they've forgotten to add their text to the e-mails, or are just not reading the documentation on how to successfully send their spam.
  • by patio11 (857072) on Wednesday August 09 2006, @01:10PM (#15874952)
    The term-of-art within the anti-spam community is "Bayes Poison". Generally its appended to an actual spammy offer, but some spammers have in the past used the technique with web-bugs to determine whether they are able to deliver to particular boxes with non-spammy content, so that they can evaluate whether their later more-spammy content was excessively spammy or whether it hit the sweet spot on the blocked vs. effective-sales-pitch continuum. Most people in the anti-spam community report that garden variety Bayes Poison is ineffective at either de-spamming spammy messages or causing your corpora to be skewed to the effect that they are unusable. One major reason for this is that corpora are so specific to individual users. For example, poisoning my inbox with copies of Huckleberry Finn is rather ineffective because nobody I talk with on a regular basis writes like Mark Twain. For you to do actual damage, you would have to know enough my habits to guess subjects and words which appeared very commonly in legitimate mail -- for example, the names of my family members, keywords relating to my job or extracurricular interests, etc. It is very difficult for spammers to get this information, but some academics have reported that it is theoretically possible, although in practical terms very difficult, to use web bugs to extract the "secret sauce" needed to land in one particular inbox. http://www.jgc.org/SpamConference011604.pps [jgc.org]
    • by seanyboy (587819) * on Wednesday August 09 2006, @01:25PM (#15875078)
      Verily, I undertand thy point, but for all the sense thine words make to mine ears, I still cannot understand what villainous treachory it is that makes spam filters reject my own missives out of hand. It is a mystery, and one I feel even the local constabulary could not crack.
  • by nasor (690345) on Wednesday August 09 2006, @01:26PM (#15875094)
    For a while now I've been getting spam for various products or services where the spammers purposely misspell words, spell words with a mix of letters and numbers "l33t" style, or spell words phonetically. I assume that this is to get past spam filters, and I imagine it works to some extent. The question is, do they honestly think anyone would ever buy something from a company that advertises "ch3@p nonperscrip70n med1ca7ion" or "lo morgage rates"? Who the hell would ever do business with a company that can't even seem to spell properly?
    • by Coventry (3779) * on Wednesday August 09 2006, @01:12PM (#15874964) Journal
      Just like the cryptic number sequence radio/voip 'stations', this could be a method of communication.

      We see so much Spam everyday, everyone takes it for granted, and everyone runs 'filters'. If I wanted to secretly inform agents to begin operations, a select quote from a book sent as spam to hundreds of thousands of people would be perfect. Everyone ends up on spam-lists, and recieving spam is a passive process, so its even more anonymous than public web forums.

    • by pclminion (145572) on Wednesday August 09 2006, @01:12PM (#15874970)

      By having a baysian filter forget over time, it also helps shrink down the database and helps it adapt as the contents of spam change over time.

      Having the filter forget is the ONLY effective policy. In statistical filtering, it is certainly NOT true that more data == better results. You want a sample of data that most accurately represents the sort of content you are receiving RIGHT NOW. I completely purge my Firefox Bayesian database every couple of months and retrain on recent emails only. The result is ALWAYS an increase in accuracy, particularly a reduction in false positives.

      • Re:Ditto. (Score:5, Funny)

        by bunions (970377) on Wednesday August 09 2006, @01:13PM (#15874976)
        The only news is they're now calling it Spam 2.0


        that's probably because they're spamming Ajax-enabled sites in the blogosphere about linkrolling the mashups.
    • by blueZ3 (744446) on Wednesday August 09 2006, @01:13PM (#15874981) Homepage
      Spam and anti-virus are good examples of fields where the "solution" is reactive to the problem.

      1. Spammers and malicious code writers come up something annoying.
      2. Anti-spam and anti-virus software reacts with a method to prevent the annoyance.
      3. Spammers and virus writers implment new tactics.
      4. Repeat steps 2 and 3 ad infinitum
      (The "Proft!" step is probably at 1a and 3b, but that's another issue)

      It's not that the spammers are "beating" the spam filters, it's that they are using new tactics and it takes a certain amount of reaction time for the filters to be updated to fight the newly evolved threat. This is why spam filters aren't the ultimate solution to spam, though they are a useful stop-gap
    • by CohibaVancouver (864662) on Wednesday August 09 2006, @02:00PM (#15875368)
      be interested to know how many people put up money for products / services they were spammed with.

      Quite a few, apparently.

      I read one article which claimed that one spammer in particular "received 10,000 credit card orders in one month [snip] each for $39.95 US."

      So that's nearly $400,000 per month. Nice work if you can get it.

      Source:

      http://www.cbc.ca/story/business/national/2005/04/ 08/spam-050408.html [www.cbc.ca]

      • by bunions (970377) on Wednesday August 09 2006, @01:20PM (#15875031)
        I swear I hit the 'preview' button and not 'submit.' I blame the soviet mind-control lasers. Here is my post as it should have been:

        my favorites are the ones that put the filter poison into bogus html tags that aren't rendered by Outlook. So I'd get something like

        <oodles> <mycotoxin> <greengrocer> <chubby> <kazoo>
        Buy my shit
        <snappy> <bundle> <chaff> <glum>

        the <greengrocer> tag was my favorite. I sent an RFE to the W3C people, but I haven't heard back yet :mad:
    • by quokkapox (847798) <quokkapox@gmail.com> on Wednesday August 09 2006, @01:30PM (#15875120)
      where it's not even worth filling this out anymore...

      You advocate a

      ( ) technical ( ) legislative ( ) market-based ( ) vigilante

      approach to fighting spam. Your idea will not work. Here is why it won't work. (One or more of the following may apply to your particular idea, and it may have other flaws which used to vary from state to state before a bad federal law was passed.)

      ( ) Spammers can easily use it to harvest email addresses
      ( ) Mailing lists and other legitimate email uses would be affected
      ( ) No one will be able to find the guy or collect the money
      ( ) It is defenseless against brute force attacks
      ( ) It will stop spam for two weeks and then we'll be stuck with it
      ( ) Users of email will not put up with it
      ( ) Microsoft will not put up with it
      ( ) The police will not put up with it
      ( ) Requires too much cooperation from spammers
      ( ) Requires immediate total cooperation from everybody at once
      ( ) Many email users cannot afford to lose business or alienate potential employers
      ( ) Spammers don't care about invalid addresses in their lists
      ( ) Anyone could anonymously destroy anyone else's career or business

      Specifically, your plan fails to account for

      ( ) Laws expressly prohibiting it
      ( ) Lack of centrally controlling authority for email
      ( ) Open relays in foreign countries
      ( ) Ease of searching tiny alphanumeric address space of all email addresses
      ( ) Asshats
      ( ) Jurisdictional problems
      ( ) Unpopularity of weird new taxes
      ( ) Public reluctance to accept weird new forms of money
      ( ) Huge existing software investment in SMTP
      ( ) Susceptibility of protocols other than SMTP to attack
      ( ) Willingness of users to install OS patches received by email
      ( ) Armies of worm riddled broadband-connected Windows boxes
      ( ) Eternal arms race involved in all filtering approaches
      ( ) Extreme profitability of spam
      ( ) Joe jobs and/or identity theft
      ( ) Technically illiterate politicians
      ( ) Extreme stupidity on the part of people who do business with spammers
      ( ) Extreme stupidity on the part of people who do business with Microsoft
      ( ) Extreme stupidity on the part of people who do business with Yahoo
      ( ) Dishonesty on the part of spammers themselves
      ( ) Bandwidth costs that are unaffected by client filtering
      ( ) Outlook

      and the following philosophical objections may also apply:

      ( ) Ideas similar to yours are easy to come up with, yet none have ever been shown practical
      ( ) Any scheme based on opt-out is unacceptable
      ( ) SMTP headers should not be the subject of legislation
      ( ) Blacklists suck
      ( ) Whitelists suck
      ( ) We should be able to talk about Viagra without being censored
      ( ) Countermeasures should not involve wire fraud or credit card fraud
      ( ) Countermeasures should not involve sabotage of public networks
      ( ) Countermeasures must work if phased in gradually
      ( ) Sending email should be free
      ( ) Why should we have to trust you and your servers?
      ( ) Incompatiblity with open source or open source licenses
      ( ) Feel-good measures do nothing to solve the problem
      ( ) Temporary/one-time email addresses are cumbersome
      ( ) I don't want the government reading my email
      ( ) Killing them that way is not slow and painful enough

      Furthermore, this is what I think about you:

      ( ) Sorry dude, but I don't think it would work.
      ( ) This is a stupid idea, and you're a stupid company for suggesting it.
      ( ) Nice try, assh0le! I'm going to find out where you live and burn your house down!