Stories
Slash Boxes
Comments

News for nerds, stuff that matters

Slashdot Log In

Log In

[ Create a new account ]

New Kind of Spam 'Un-Training' Filters?

Posted by ScuttleMonkey on Wed Aug 09, 2006 11:53 AM
from the battle-lines-being-drawn dept.
Zaphod2016 writes to tell us the Wall Street Journal is reporting that email in-boxes are under a new kind of spam attack. This new spam has confused many people due to its lack of advertising, viruses, or request for personal information. One popular theory is that these innocuous blocks of text, often drawn from popular literature, are being used to "un-train" spam filters to allow more malicious spam through in the future.

Related Stories

[+] Backslash: Wireless, Gaming Addiction, Spam, and More 45 comments
Of the thousands of comments on yesterday's Slashdot page, gathered below are some of the ones that defined the conversations on the rise of wireless peripherals, the meaning of content-free spam, whether one can be truly addicted to online gaming, and Intel's move to open source some of its graphics adapter drivers. Read on for the Backslash summary.
This discussion has been archived. No new comments can be posted.
Display Options Threshold:
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1) | 2
  • Other way around? (Score:5, Insightful)

    by Sepodati (746220) on Wednesday August 09 2006, @11:58AM (#15874827)
    (http://www.bigredspark.com/)
    Wouldn't it work the other way around? I still flag crap like this as spam, so it seems like it'd train my spam filter to have more false positives, no?

    ---John Holmes...
    • Re:Other way around? (Score:5, Interesting)

      by pe1chl (90186) on Wednesday August 09 2006, @12:01PM (#15874855)
      At work our spamassassin bayes filter has heavily trained on English text always being spam.
      This is because English is not our local language, so almost no business communication is in English and most of the spam is.
      This indeed sometimes causes false positives when English language mail has other spam-like properties as well, and the added 3.5 points from the Bayes filter pushes it above the limit.

      This again shows that you should not use solely a Bayes filter as spam blocker.
      [ Parent ]
      • Re:Other way around? (Score:5, Informative)

        by TubeSteak (669689) on Wednesday August 09 2006, @12:24PM (#15875072)
        (Last Journal: Saturday February 25 2006, @11:02PM)
        My limited experience is that whatever filtering Hotmail uses has been allowing lots of Spam to slip through in the last few weeks.

        Anyone else?
        How's Yahoo & G-Mail been doing?
        [ Parent ]
        • Re:Other way around? (Score:5, Interesting)

          by Skynyrd (25155) on Wednesday August 09 2006, @12:40PM (#15875183)
          (http://liberalredneck.org/)
          My limited experience is that whatever filtering Hotmail uses has been allowing lots of Spam to slip through in the last few weeks.

          Anyone else?
          How's Yahoo & G-Mail been doing?


          I use gmail, and although it's let one or two pieces of spam through in the last week, it's always been near 100%.

          I get 50-100 email a day on gmail.
          [ Parent ]
        • Re:Other way around? by Andrew Kismet (Score:2) Wednesday August 09 2006, @12:55PM
        • Re:Other way around? by AuMatar (Score:2) Wednesday August 09 2006, @12:55PM
        • Re:Other way around? (Score:5, Informative)

          How's Yahoo & G-Mail been doing?

          Here are actual samples of emails that Gmail and Yahoo have let through to my inbox over the past couple days. First, Gmail:

          Wells, who has had a rather similar historyand who obviously owes something to Dickens as novelist. In some ways his outlook is verysimilar to Dickenss. No one who is really involved in the landscape ever sees thelandscape. To Chesterton the poor means small shopkeepers andservants. There is nothing psychologically false in this, either. No one who is really involved in the landscape ever sees thelandscape. It is easy to imagine what the young woman would have said to this inreal life. And given the FACT ofservitude, the feudal relationship is the only tolerable one. Theother point is that Dickenss early experiences have given him a horrorof proletarian roughness. They, and the men, always spoke of me as the younggentleman. It is one of the stockjokes of English literature, from Malvolio onwards. Buthe is remarkably free from the idiocy of regarding nations asindividuals. So were all the characteristic English novelists of thenineteenth century. The last thing anyone ever remembers about the books is theircentral story. Nevertheless hislist of most hated types is like enough to Wellss for the similarity tobe striking. A change of heart is in fact THE alibi of peoplewho do not wish to endanger the STATUS QUO. There is nothing psychologically false in this, either. Pickwick and the servant should be Sam Weller. It is noticeable thatDickens hardly writes of war, even to denounce it. Therewere no labour-saving devices, and there was huge inequality of wealth. In Dickenss novels anything in the nature of work happens off-stage. And, on the whole, his attacks on good society are ratherperfunctory. But byorigins and upbringing Thackeray happens to be somewhat nearer to theclass he is satirizing. Here perhaps Gissing is influenced by his own love of classical learning. In a rather different sense his attitude to life is extremely unphysical. It is usual to claim him as a popularwriter, a champion of the oppressed masses. Dickens would be quite incapable of this. Compare any lawsuit in Dickens with the lawsuit inORLEY FARM, for instance. I do consider the young ooman, sir, said Sam. Here the contrast between Dickens and, say, Trollopeis startling. It is true that not all his novelsare alike in this. He getshimself arrested in order to follow Mr. Progressis not an illusion, it happens, but it is slow and invariablydisappointing. If his palms are hard from work, they let him in; if his palms aresoft, out he goes. It is perhaps more significant that he shows noprejudice against Jews. At first sight this statement looks flatly untrueand it needs some qualification. A modern manservant would neverthink of doing either. There arepractically no friendly pictures of the landowning class, for instance. If one wants a modern equivalent,the nearest would be H.


          Attached to the above was an image file that contained an obvious ad. So to Gmail, this apparently looks like a regular text email that happens to have an attached image.

          (You can argue about how effective this is, since Gmail thumbnails all images, meaning you'd need to click a separate link to open it and read it.)

          Now Yahoo, where I get approximately 1,000 messages to my bulk folder per day - this is the only one that's gotten through to my inbox in the last day:

          FROM THE DESK OF Mrs Queen Adams
          BANK OF AFRICA [BOA]
          OUAGADOUGOU, BURKINA FASO.

          DEAR FRIEND,

          I AM HOPEFUL THAT THIS MAIL WILL REACH YOU IN GOOD CONDITION OF
          HEALTH.I AM MRS QUEEN ADAMS A STAFF OF BANK OF AFRICA AND A BURKINABE RESIDENT
          IN BURKINA FASO ALSO.IN THE BANK WHERE I WORK AS AN AUDITOR,I
          DISCOVERED AN ABANDONED SUM OF MONEY AMOUNTING TO 15.2MILLION DOLLARS BELONGING
          TO DR GEORGE BRUMLEY WHO UNFORTUNATELY DIED IN THE PLANE CRASH OF UNION
          TRANSPORT AFRICAN FLIGHT BOEING 727 IN KENYA, EAST AFRICA ON SUNDAY
          [ Parent ]
        • I get Gmail false-negatives in clumps by wsanders (Score:2) Wednesday August 09 2006, @01:21PM
        • Re:Other way around? by zip_000 (Score:1) Wednesday August 09 2006, @01:38PM
        • Re:Other way around? by Given M. Sur (Score:1) Wednesday August 09 2006, @01:41PM
          • 1 reply beneath your current threshold.
        • Re:Other way around? (Score:4, Interesting)

          by porcupine8 (816071) on Wednesday August 09 2006, @02:07PM (#15875878)
          (Last Journal: Monday November 07 2005, @10:05AM)
          Actually, you haven't noticed any legitimate emails from Yahoo getting tossed as spam, have you? (Just curious, I've emailed my dad three times in a row with no response, even though he's forwarded me stuff in between, and he's usually quick to respond, so I'm worried Hotmail is tagging emails from Yahoo addresses or something.)

          I think I've confused Yahoo by applying for a mortgage. So I've been getting lots of legitimate mortgage and real estate-related emails, and it's been starting to let through a few related spams as well.

          Other than that, I haven't been getting any more stray spam than usual. Maybe once a week I'll get one (that's not mortgage-related) that the filter misses.

          Then there are the ones that go to email lists that I have filtered to other boxes besides Inbox... Since you can't pick when the spam filter works, it always works AFTER all your others, and so I get all of these. *sigh*

          [ Parent ]
        • Re:Other way around? by kirun (Score:1) Wednesday August 09 2006, @02:12PM
        • Re:Other way around? by shoolz (Score:2) Wednesday August 09 2006, @04:56PM
        • Re:Other way around? by eison (Score:2) Wednesday August 09 2006, @05:17PM
        • Re:Other way around? by BootNinja (Score:1) Wednesday August 09 2006, @05:33PM
        • Re:Other way around? by rvqbl (Score:1) Wednesday August 09 2006, @07:37PM
        • Re:Other way around? by freedom_india (Score:2) Wednesday August 09 2006, @08:25PM
        • Re:Other way around? by Tim C (Score:2) Thursday August 10 2006, @04:17AM
        • Re:Other way around? by Anivair (Score:1) Thursday August 10 2006, @06:38AM
        • Re:Other way around? by 11_biznatch_11 (Score:1) Thursday August 10 2006, @10:53AM
        • Re:Other way around? by kbahey (Score:2) Thursday August 10 2006, @11:29PM
      • Re:Other way around? (Score:5, Interesting)

        by ericlondaits (32714) on Wednesday August 09 2006, @12:30PM (#15875121)
        (http://www.derol.com.ar/)
        I Recommend that you subscribe to a couple of english language Mailing Lists (or Yahoo Groups), which you can then filter and move to a mail subfolder of their own easily through the Subject line or From Address. That way you can have good english non-spam mails going through your Bayes daily.
        [ Parent ]
      • Re:Other way around? by no1nose (Score:1) Wednesday August 09 2006, @02:08PM
      • FIltering based on Language or Country by billstewart (Score:2) Wednesday August 09 2006, @08:21PM
    • Re:Other way around? by 0racle (Score:2) Wednesday August 09 2006, @12:02PM
    • Re:Other way around? by TheOrangeMan (Score:1) Wednesday August 09 2006, @12:02PM
    • Re:Other way around? by John Hasler (Score:3) Wednesday August 09 2006, @12:09PM
    • by khasim (1285) <brandioch.conner@gmail.com> on Wednesday August 09 2006, @12:13PM (#15874973)
      I still flag crap like this as spam, so it seems like it'd train my spam filter to have more false positives, no?
      No. Unless the people you usually corresponde with also include blocks of the same text.

      The only way to increase the false positives is to get the spam filter to learn the words that usually appear in your legitimate messages.

      Since the spammers have no way of knowing what those words are, there is no way they can bypass your filters ... and still be effective in getting through any one else's filters.
      [ Parent ]
    • Re:Other way around? by Kaffien (Score:1) Wednesday August 09 2006, @02:54PM
    • Yes, I think some of it is a censorship attempt. by twitter (Score:2) Wednesday August 09 2006, @03:08PM
    • Re:Other way around? by thoughtlover (Score:1) Thursday August 10 2006, @01:18AM
    • 2 replies beneath your current threshold.
  • This isn't new by bunions (Score:2) Wednesday August 09 2006, @11:58AM
  • I got some. by Anonymous Coward (Score:1) Wednesday August 09 2006, @11:58AM
  • This is news? by mrxak (Score:2) Wednesday August 09 2006, @11:59AM
  • Vectorspaces (Score:5, Interesting)

    by bigattichouse (527527) on Wednesday August 09 2006, @11:59AM (#15874839)
    (http://www.bigattichouse.com/)
    As a hobby, I play around with ways to classify spam. Not much of a hobby, but I find the problem interesting.

    Lately, I've also been trying to use my vectorspace engine to classify spam.. so these sorts of things might get in, but only because they fall into the general category of readable text...

    I've also been thinking about building a GPL tool to provide "sound-based" classification sort of like a "one second orchestra" playing in harmony/disharmony based on the content.

    Regardless of the engine I use, I still have to dig through my trash bin every few days to make sure nothing good slipped through.
  • by Scutter (18425) on Wednesday August 09 2006, @12:00PM (#15874842)
    (Last Journal: Wednesday January 15 2003, @08:09AM)
    It is such animportant element, you see, that duration
    of time. I consider twelve hours a substantial measure. So I ran along
    the drive and upthe steps and into the house, but did not see either
    Mrs. Iobserved:Your Excellency is not easily satisfied. And I marvelled,
    and said:How comes it that I have hitherto been deaf to these
    distressfultones? Il passe sur la route, mais toujours en sens inverse.
    For a mental state such astheirs, appetency rather than instability is
    the right word. Which reminds me that the old adage about let us eat and
    drink, forto-morrow, etc. Mais odonc est la vie, sinon dans le peuple?
    They lamented dismally among themselves in many tongues:How I suffer!
    Take that little one on Lzards, for instance;or, in the other volume,
    the bizarre Joies Noires.
  • What delayed stories by wbtittle (Score:1) Wednesday August 09 2006, @12:00PM
  • I just thought they were weird. by celardore (Score:2) Wednesday August 09 2006, @12:00PM
  • by sotweed (118223) on Wednesday August 09 2006, @12:00PM (#15874848)
    I've been getting 3 or 4 of these a day for at least a month now. The text can
    always be found in some file of an old book provided by the Gutenberg
    Project, which is making non-copyright texts available through volunteer
    effort.

    I think the theory about using this stuff to untrain spam filters is very plausible.
    But it's difficult to see how it will work. There's no common text among these
    e-mails; in order to send effective spam, there'll have to be at least some text which
    is the same across multiple mails, and that will tend to expose it.
    • Re:The text comes from the Gutenberg Project by John Hasler (Score:2) Wednesday August 09 2006, @12:13PM
    • by misleb (129952) on Wednesday August 09 2006, @12:23PM (#15875062)
      . There's no common text among these
      e-mails;


      I think that is the point. They want to either poison those words so you get more false positives or they want to push other REAL spam related words out of the "this is spam" dictionaries. Maybe both. If these messages had some common theme, they would all get blocked and would have no net effect. They need you to click "this is spam" to poison your filters.

      Question is, does it work? I don't know. Seems to be highly dependent on the nature of your spam filter. Maybe they are only targeting a specific, popular filtering system.

      To me it seems like an act of deparation. I think filters are finally catching up with spammers. It is getting more and more difficult to get spam through a half way decent filter and there are a lot of decent filters out there.

      -matthew
      [ Parent ]
      • by letxa2000 (215841) on Wednesday August 09 2006, @01:02PM (#15875377)
        think that is the point. They want to either poison those words so you get more false positives or they want to push other REAL spam related words out of the "this is spam" dictionaries. Maybe both. If these messages had some common theme, they would all get blocked and would have no net effect. They need you to click "this is spam" to poison your filters. Question is, does it work?


        Answer is: No, it won't. At least not with Bayesian. The only way to mess up a Bayesian filter is if they can send you messages that are heavy in words/terms that often appear in your good email. And that's going to vary from user to user. Unless you're sending me the exact words that I use in my daily emails, adding a plethora of other words is not going to make my filter any less accurate or create more false positives. It will either let my filter recognize your "poison" as spam itself or, at worst, be neutral.

        My Bayesian filter, among other things, considers an excessive number of infrequently/never used terms as a characteristic that is itself subject to Bayesian classification. So while the "poison words" have no statistical effect on my filter, the fact that a bunch of unusual words are found in a message is going to increase the chance that my filter correctly recognize the message as spam.

        My spam was constantly growing through about December of last year. This year, it seems to have leveled off. Sure, I'm still getting just under 20,000 per month which sucks, but I see almost none of them and according to my spam stats, the spam has leveled off. Hopefully this is the plateau before it falls. :)

        I still want to know: Who are the idiots who BUY spammed products???


        [ Parent ]
      • Re:The text comes from the Gutenberg Project by Ecks (Score:2) Wednesday August 09 2006, @01:46PM
    • Re:The text comes from the Gutenberg Project by Tremor (APi) (Score:1) Wednesday August 09 2006, @12:30PM
    • by Ed Avis (5917) <ed@membled.com> on Wednesday August 09 2006, @01:02PM (#15875376)
      (http://membled.com/)
      If the spammers are now sending round Gutenberg texts, this is entirely appropriate. Project Gutenberg caused probably the first ever spam, when Michael Hart launched the project by trying to mail everyone on ARPANET with the U.S. Declaration of Independence. (source [lwn.net])
      [ Parent ]
      • Re:The text comes from the Gutenberg Project by mdielmann (Score:3) Wednesday August 09 2006, @03:15PM
      • by crabpeople (720852) on Wednesday August 09 2006, @04:30PM (#15876822)
        (Last Journal: Friday January 30 2004, @06:40PM)
        "Project Gutenberg caused probably the first ever spam,"

        Close but incorrect. I believe it was an add for some kind of seminar a guy was giving on the west coast. He was from the east coast and had no contacts to sell this product in the west so he manually typed in like hundreds of addresses. I dont know if i can find a link but i remember reading about it.

        Ok aparently googling for "first spam ever" yields this article [templetons.com]:

        "The sender is identified as Gary Thuerk, an aggressive DEC marketer who thought Arpanet users would find it cool that DEC had integrated Arpanet protocol support directly into the new DEC-20 and TOPS-20 OS. I spoke with him to get his reflections on the event.

        DEC was mostly an east coast company, and he had lots of contacts on the east coast to push the new Dec-20 to customers there. But with less presence on the west coast, he wanted to hold some open houses and reach all the people there. In those days, there was a printed directory of all people on the Arpanet. Gary spoke to his technical associate, and arranged to have all the addresses in the directory on the west coast typed in, and then added some customer contacts in other locations, including people at ARPA headquarters who did not, according to Thuerk, complain.

        The engineer, Carl Gartley, was an early employee at DEC who had been called in to help with promoting the new Decsystem-20. They worked on the message for a few days, going through a few rewrites. Finally, on May 3, Gartley logged on to Gary's account to send the mail. "

        so there you go. First spam May 3, 1978. Theres a reply to it from RMS too (his inital reaction was pro spam heh).

        [ Parent ]
    • Wrong... They are using all types of books by technoextreme (Score:2) Wednesday August 09 2006, @01:09PM
    • Re:The text comes from the Gutenberg Project by nutsy (Score:2) Thursday August 10 2006, @02:36PM
    • 2 replies beneath your current threshold.
  • Spammers beating academy? by UbuntuDupe (Score:1) Wednesday August 09 2006, @12:01PM
  • Not to me... by GmAz (Score:2) Wednesday August 09 2006, @12:01PM
  • specious defillibrator by kimvette (Score:2) Wednesday August 09 2006, @12:01PM
  • My uninformed hunch: screwup... (Score:5, Interesting)

    by nweaver (113078) on Wednesday August 09 2006, @12:01PM (#15874858)
    (http://www.icsi.berkeley.edu/~nweaver/)
    The text block spam is very common WITH images . I suspect that what happened is some lame spammer got a BIG botnet contract, sent out his spam, and forgot to include the image.
  • Whatever it does, it sure is bizarre by Guanine (Score:2) Wednesday August 09 2006, @12:01PM
  • NPR article by Anonymous Coward (Score:2) Wednesday August 09 2006, @12:02PM
  • Spam fell? by Gary W. Longsine (Score:2) Wednesday August 09 2006, @12:03PM
    • Re:Spam fell? by Zenaku (Score:1) Wednesday August 09 2006, @12:13PM
    • Re:Spam fell? by eddy (Score:1) Wednesday August 09 2006, @12:16PM
    • Re:Spam fell? by misleb (Score:2) Wednesday August 09 2006, @01:15PM
  • Un-training? Hardly. (Score:5, Informative)

    by pclminion (145572) on Wednesday August 09 2006, @12:03PM (#15874879)

    Bayesian and other filters do not rely on "spammy" words alone -- they also rely on "unspammy" words, and spammers have no idea what those words are because each person receives different email.

    A scenario, with made up (but plausible) numbers: Suppose you're a developer of a Linux driver for the Bozodrive 1000. The majority of your legitimate email comes from Linux driver development mailing lists. A full 50% of those emails contain the word "IRQ." 99% of the emails contain the word "driver," and 15% contain the word "Johannsen" which is in the signature of one of your friends. And precisely 0% of the emails containing any of these terms have ever been found to be spam.

    Any decent spam filter will give a huge weight to the presence of these "unspammy" words, because of the extremely high probability of emails containing them to be non-spam. The presence of randomly selected confusion words in empty spams is not going to affect these frequency counts.

    In order to defeat a filter by confusing it, the spammer must guess what the SPECIFIC non-spam words for that PARTICULAR email user are, and then produce bogus, spam messages containing those words in the appropriate frequencies. This will cause the classification counts for those words to become more equalized, and the value of those words in determining spammyness to be greatly reduced. However, this is an impossible task unless the spammer has access to the actual emails of the target.

    Perhaps the intent of the empty spams is to confuse the filters, but whoever devised the method has no understanding of how these things actually work, whatsoever.

  • Other possibilities by Red Flayer (Score:2) Wednesday August 09 2006, @12:05PM
    • Re:Other possibilities (Score:5, Interesting)

      by Coventry (3779) * on Wednesday August 09 2006, @12:12PM (#15874964)
      (Last Journal: Monday July 25 2005, @01:50PM)
      Just like the cryptic number sequence radio/voip 'stations', this could be a method of communication.

      We see so much Spam everyday, everyone takes it for granted, and everyone runs 'filters'. If I wanted to secretly inform agents to begin operations, a select quote from a book sent as spam to hundreds of thousands of people would be perfect. Everyone ends up on spam-lists, and recieving spam is a passive process, so its even more anonymous than public web forums.

      [ Parent ]
  • Obligatory by gatkinso (Score:1) Wednesday August 09 2006, @12:05PM
  • Weasels abound (Score:3, Interesting)

    by Bullfish (858648) on Wednesday August 09 2006, @12:06PM (#15874899)
    I have seen some of these slip though for a while I think the only purpose for them is to get some neophyte who is confused by them to send back a "WTF?" response thereby confirming a "live one". I suspect after that the floodgates open. I am sure that we will see many more attempts to circumvent filters. After all, weasels abound.
  • I buy the "broken spamware" angle (Score:5, Insightful)

    by nuzak (959558) on Wednesday August 09 2006, @12:08PM (#15874918)
    The WSJ article also gives due time to the theory that the spamware is simply broken and that the spam payload is being delivered with the padding and not the payload. Since I've previously seen plenty of Gutenspam (my name for this spam that contains snips from Gutenberg texts) with an image payload attached, I'm definitely leaning toward the notion that they slipped somewhere and are now not delivering the image.

    Woe betide literature discussion groups now that filters are trained on the classics.

    • Re:I buy the "broken spamware" angle (Score:5, Interesting)

      by Richard_at_work (517087) <richardprice.gmail@com> on Wednesday August 09 2006, @12:26PM (#15875093)
      I dont think this is the case, as Ive been getting these sorts of emails for at least 3 years (looking back at the spam archive I keep to train from) - random blocks of legible text, blocks of psuedo english (words are correct but theres no effort at sentence structure), even jokes on their own. I got intrigued by this about 6 months ago and wrote a few scripts to see if it was just a broken spam client forgetting to add the payload, but your average 'with payload' spam doesnt seem to match these emails, theres practically no similiar 'with payload' spams in my archive with these blocks of text.

      I always wrote it off as baysian filter poisoning.
      [ Parent ]
  • My home spam filter does not seem to be affected much. I run dspam [nuclearelephant.com] which has a feature in that over time it will forget words if they are not used in spam. Since the text is usually different or random, it does not have any significant effect on generating false positives. In the years I have been running dspam with tens of thousands of emails, I have only gotten 3-4 false positives.

    By having a baysian filter forget over time, it also helps shrink down the database and helps it adapt as the contents of spam change over time.

    Of course I also use other spam blocking techniques, like using realtime black lists (RBLs) and blocking a number of Chinese subnets... I should add tpnet.pl and Verizon as well.
    • by pclminion (145572) on Wednesday August 09 2006, @12:12PM (#15874970)

      By having a baysian filter forget over time, it also helps shrink down the database and helps it adapt as the contents of spam change over time.

      Having the filter forget is the ONLY effective policy. In statistical filtering, it is certainly NOT true that more data == better results. You want a sample of data that m