Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Spam The Internet

How to Get Rid of Referrer Spam? 56

wikinerd asks: "I have recently opened my own community website. Everything was fine until spammers found it, which happened quite quickly. As usual they filled up my mailboxes, but SpamAssassin can take care of that when it is needed. Then, they discovered my blog and my wikis and employed their bots to fill them up with spam comments. I solved this problem by moderating all comments. Now, however, they employed another evil trick: Referrer spam. They caused my webserver statistics to grow up by orders of magnitude by making their stupid websites to show up on my referrer lists. Unfortunately now my webserver usage statistics are full of viagra, poker, casino, porn, spyware, and pharmacy sites. I am afraid that this is a problem I cannot solve with the knowledge and the tools I have at the moment. So, I came here to ask Slashdot readers: How can I fight referrer spam and what tools are available in a GNU/Linux environment to ensure clean and spam-free usage statistics?"
This discussion has been archived. No new comments can be posted.

How to Get Rid of Referrer Spam?

Comments Filter:
  • by Anonymous Coward on Friday February 04, 2005 @10:56AM (#11572618)
    I'll assume you're using Apache and have access to the .conf, or someone that does.

    First, you need to setup the log you'll use for statistics to exclude requests marked with a "nolog" environment variable.

    CustomLog logs/access_log-www.example.com combined env=!badreferer

    The following requires Apache's SetEnvIf module. You can put these lines in .conf, or even in .htaccess so you can change them without a restart. If you don't have/want SetEnvIf, you can also use mod_rewrite (E=badreferer:1 at the end of your RewriteRule) to do the same thing.

    #Blacklist (adjust as you need)
    SetEnvIfNoCase Referer ".*(credit|hold-em|holdem|mortgage|money|cash|gb.c om|4free|teen|pussy|discount|inkjet|fuck|hasfun|ca sino|gambling|poker|porn|sex|paris|nude|xxx|hilton |adminshop|devaddict|iaea|peng|just-deals|pisx|tec rep-inc|learnhow|phentermine|terashells|psxtreme|f reakycheats).*" badreferer
    #Whitelist (optional)
    SetEnvIfNoCase Referer ".*(google|yahoo|alltheweb|search|excite|aol.com|l ycos|msn|altavista|XXXX).*" !badreferer

    Additionally, you can use the same blocks to deny them access to your site:

    <Limit GET HEAD POST>
    Order Allow,Deny
    Allow from All
    Deny from badreferer
    </Limit>

    <LimitExcept GET HEAD POST>
    Order Deny,Allow
    Deny from All
    </LimitExcept>
    • by wowbagger ( 69688 ) on Friday February 04, 2005 @11:25AM (#11572971) Homepage Journal
      I'd take it one step further - log the IP addresses of the machines denied by the bad referrer, and report them to their ISP, and to some of the open relay/trojan blacklists.

      You could even try configuring your software to use such blacklists to deny trojaned machines access completely.

      Additionally, if you wanted, you could then add those IP addresses to your firewall rules to drop the requests at the firewall.

      Lastly, you could teergrub them - set things up to...

      Respond...

      Very...

      Slowly...

      To...

      Their...

      Request...

  • Could Wikinerd or Cliff post an example of how these appear in Wikinerd's blog? I have a guestbook myself that gets filled with things that say "great site" from some dumb address like cara@aol.com, and then it is filled with a bunch of keyword HTML links to randomly-generated .info sites (5544f45.info, etc) that all go to one of those useless spammy search engines.
    • That's comment spam. Referrer spam targets (I think) sites that have a viewable list of top referrers.

      Here's a series of posts [littlegreenfootballs.com] dealing with the issue on LGF. (Note: I'm posting this link in the context of referrer spamming -- no political statement is intended, and no political arguing over it is desired.)

  • by j-turkey ( 187775 ) on Friday February 04, 2005 @11:08AM (#11572740) Homepage

    I hope I'm not being too rude, but seriously, I googled for referrer spam [google.com] and bam...first result had some decent advice [spywareinfo.com]. This was just the first thing that came up. Add the word "apache" to your query and you will get some very helpful results [google.com]. Besides, this is Slashdot...not a trove of reliable information/advice. Just start using Apache to start blocking the Mallorys. Also, if you're still posting any kind of statistics or referrers publicly, stop. Spammers wouldn't do this if Bloggers didn't publish that kind of abusable data.

    • by bill_mcgonigle ( 4333 ) * on Friday February 04, 2005 @11:17AM (#11572866) Homepage Journal
      Should you decide to move two centimeters towards rude, slashdot plays nicely with these links [justfuckinggoogleit.com].
    • by BoomerSooner ( 308737 ) on Friday February 04, 2005 @11:49AM (#11573267) Homepage Journal
      For those of you out there that still cannot figure it out. Ask slashdot is for the poster but also can provide relevant information to other people that didn't think of the problem in the same way. For example, I do not host any blogs at my company but if I decided to I would have this question and answer set as a good reference (in addition to googling).

      Googling info isn't always the best, frequently people contribute things to this blog that you cannot duplicate by a simple query on google.

      And last but not least you can always turn ask slashdot off in your preferences....

      So for the last fucking time: YES HE CAN GOOGLE IT BUT SHE DECIDED TO ASK SLASHDOT INSTEAD. Move on.
      • So for the last fucking time: YES HE CAN GOOGLE IT BUT SHE DECIDED TO ASK SLASHDOT INSTEAD. Move on.

        Hey, be nice. Was I really impolite (kinda like you're being right now)? Did I, or did I not provide helpful information to the poster?

        Lighten up, Francis.

      • by yog ( 19073 ) on Friday February 04, 2005 @12:47PM (#11573962) Homepage Journal
        Yeah, I find ask slashdot useful too. When you filter out the "Why didn't you just google it, moron?" type comments and the "why would you want to do that anyway" trolls, you sometimes get some useful information and discussion regarding the various ways to solve the O.P.'s problem.

        I see Ask Slashdot not as a substitute for a simple keyword search but rather a supplemental verification process. I have found that keyword searches don't necessarily reveal best practices; you get unedited, unrefuted claims that you have to sift through. In a reasonably informed techie discussion forum like Slashdot (sometimes), you can get some interesting debate and comparisons on various approaches and methodologies.

        And, as you noted, it's a way to be exposed to problems which I don't currently have but might someday. Then when I encounter the problem, I hope a little fragment of memory in my aging brain will bubble to the surface to remind me that it's been discussed on Slashdot.

        For researching technical problems, the best thing is to combine Google, Slashdot, Usenet newsgroups, and specialty forums such as (in the O.P.'s case) webhostingtalk.com, spend a little time in each place and take notes. From amongst voluminous chaff generally there's a bit of wheat to be harvested. ;-)

        At the risk of belaboring the obvious, it should also be noted that the way to put useful information out there in the first place so that googlers can find it is precisely this sort of forum. Google is only your friend if there's something out there worth searching for.

      • I don't blame people for naively posting lame "I'm stuck" questions. I do blame editors for being too lazy to filter them out. And (not for the last time, alas): IT DOESN'T MAKE SENSE TO POST A QUESTION ON SLASHDOT UNLESS IT WILL LEAD TO AN INTERESTING DISCUSSION. A question that can be answered by a simple google is not very interesting.
      • Often when I google for something, often its difficult to get any useful search results besides, "why don't you google for it." While I can appreciate finding information without involvement of a forum directly, searching for information sometimes turns into a recursive black hole.

        What I have seen here is a better compilation of information than I have seen yet. So I thank the person for asking.
    • by Jerf ( 17166 ) on Friday February 04, 2005 @12:43PM (#11573903) Journal
      if you're still posting any kind of statistics or referrers publicly, stop. Spammers wouldn't do this if Bloggers didn't publish that kind of abusable data.

      They don't bother checking to see if your site publishes their referrers publically. I don't and I have it anyhow, of course. Also note my site uses a fairly obscure weblogging platform (PyDS), and that I've also customized the templates until there's no recoginizable signiture of any platform on my site, and I was still getting hammered.

      I've gone with an .htaccess solution. Here's what I'm currently using, updated just today, based on this [yarinareth.net]:
      RewriteEngine On
      RewriteBase /
      RewriteCond %{HTTP_HOST} !^(www.)?jerf.org$ [NC]
      RewriteCond %{HTTP_REFERER} ^(.*)$ [NC]
      RewriteRule ^(.*)$ %1 [R=301,L]
      SetEnvIfNoCase Referer ".*(crescentarian|xanax|datashaping|psxtr|phente|t erash|1stchoic|learnhowtoplay|1stchoice|pharmacy|p rofitbook|auction|cialis|stories-on|levitra|roulet te|prozac|debt|discount|\.biz|alumni|cheat|loan|di et|tax\.|exams|krantas|atlanta|paramountseed|web4u |mcdortablar|reservedi|credit|canadianlabels|8gold |texas-hold|hold-em|holdem|fidelityfunding|condo|s portsparent|mortgage|spoodles|money|cash|hotel|hou seofseven|stmaryonline|newtruths|popwow|oiline|fla feber|thatwhichis|tmsathai|pisoc|crepesuzette|medi avisor|commerce|easymoney|911|.vi|\.gb\.|gb\.com|4 free|macsurfer|teen|pussy|discount|blogincome|lill ystar|aizzo|webdevsquare|laser-eye|escal8|xopy|vix en1|linkerdome|youradulthosting|fick|inkjet-toner| fuck|ime.nu|perfume-cologne|italiancharmsbracelets |shoesdiscount|psnarones|hasfun|casino|gambling|po ker|porn|sex|paris|gabriola|nude|xxx|hilton|pics|v ideo|adminshop|devaddict|iaea|empathica|insurancei nfo|atelebanon|handy-sms|peng|just-deals|pisx|rimp im).*" BadReferrer
      order deny,allow
      deny from env=BadReferrer
      You'll get spaces in that of course thanks to Slashdot, so either filter them out, or grab it here [jerf.org]. (That's a symlink to the real thing, so it includes a couple of things you don't need; if you understand Apache enough to use this, it should be obvious which that is.)

      Don't forget to update the first RewriteCond line to match your server name.

      Unfortunately, this has known false positives [jerf.org], but nothing too bad for me yet. But this approach won't scale; we'll either need something more sophisticated, or to make it less useful for referrer spammers until they stop doing it. (The recent "nofollow" tag is a good start, since it's Yet Another way to try to steal Google Juice.)
  • PHP bayesian filter. (Score:3, Informative)

    by HansF ( 700676 ) on Friday February 04, 2005 @11:09AM (#11572764) Journal
    You could write a module that would check entries from your referrer log.
    The best way to check if it's spam would be with a bayesian filter [phpgeek.com].
    Sure , it will take some coding / training the filter but this seems to me like the best option.
    • It seems extremely unlikely to me that a Bayesian filter could work. There isn't enough for it to get a hold of. Plus, too much of the referrer spam is entirely new sites, which can be made up arbitrarily.

      Bayesian filters are cool and all, but they aren't magic. If you don't understand them, then when you're wondering "why hasn't somebody tried using a Bayesian filter for this problem?", the answer is probably "because it isn't an appropriate solution". After they got popular for spam, there was a mini-ren
      • Well , I must admit I haven't had any personal experience with this specific referrer-spam problem. But I 've tried the php module and think it learns pretty fast. Maybe i should experiment with the referred script and some URL little later.
        Personally I think it's as good as any solution because it will be smarter and more adaptive than most word-filter ideas mentioned in this thread.
        Furthermore, you raised a valid point. A url is quite limited to filter. But maybe the script could get the referring page
        • That might work. You're getting into arms race mode, though; it'd be easy to lie to the server you just spammed with any nice page, even including a nice link to the site you just spammed, while being a spam page for everybody else. The text your Bayes filter recieves no longer necessarily matches what is being sent out. That wouldn't be perfect, but that won't bother the spammers.

          Remember, in general, against an intelligent human attacker, only intelligent human vigilence can win. You can "what if, what i
  • by rudy_wayne ( 414635 ) on Friday February 04, 2005 @11:17AM (#11572859)

    Take off and nuke 'em from orbit.

    Just to be sure.

    • Just want to echo parent's comments - it's a losing battle ... and if you publish 'em, they will come.

      I have the web analysis program (Analog) generate privately with the referrers, but anything I put out does NOT show that. For those interested, I have a a page about referrer log spamming. [komar.org]

    • Reminds me of the part of HGTG when they Ford and Arthur stumble upon the crashed ship full of telephone booth sanitisers and advertising agency workers, etc.

      Perhaps we should launch all these questionable people into orbit and crash them into the nearesy star?
  • by IO ERROR ( 128968 ) * <error.ioerror@us> on Friday February 04, 2005 @11:17AM (#11572873) Homepage Journal
    Here's some handy Apache rules I've collected in my .htaccess file while fighting comment spammers:
    <IfModule mod_rewrite.c>
    RewriteEngine On
    # Many robots do not handle SGML or HTML correctly. These rules catch them and
    # punish them:
    RewriteRule &amp; - [NC,F,L]
    # Active exploits out in the wild
    RewriteCond %{HTTP_USER_AGENT} ^(LWP) [NC,OR]
    # Comment spammer software
    RewriteCond %{HTTP_USER_AGENT} ^(.*MSIE.*Win.9x.4.90|8484.Boston.Project|grub.cra wler|Indy.Library|Java.1|MSIE.*Windows.XP) [NC,OR]
    # Miscellaneous suspicious software
    RewriteCond %{HTTP_USER_AGENT} ^(.*DTS.Agent|libwww-perl|POE-Component-Client|WIS Ebot|.*WISEnutbot) [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^(Mozilla...0)$ [NC,OR]
    RewriteRule .* - [F,L]

    # Blank user agents, not a trackback
    # Needed because WP before 1.5-beta doesn't include a user-agent
    RewriteCond %{HTTP_USER_AGENT} ^(-?)$
    RewriteCond %{REQUEST_URI} !^(.*trackback) [OR]
    RewriteCond %{REQUEST_METHOD} !^{POST}
    RewriteRule .* - [F,L]
    </IfModule>
    Also consider the SpamAssassin plugin for WordPress [ioerror.us] which has also been ported to MovableType [kahunaburger.com].
  • At least for WordPress. It's called Spam Karma. I'm lazy, Google for it.

    If Spam Karma finds questionable words in comments -- it's configurable, and it comes with a good default list -- it sends users to a captcha. If they fail at the captcha -- and they're not on a strongbad keyword list like "viagra" and "vegas poker" -- the comments are sent for moderation.

    Works great for me. Nope, the URL in my profile is not my blog anymore, it's on my own server, it's in portuguese and I ain't gonna expose my serve
  • Password (Score:2, Interesting)

    by rehannan ( 98364 )
    I just password protect the directory with the server stats.
    • That really works. I was getting a pile of spam hits, but I put a password on the log stats directory and it's dropped off a bit.
      • That's a great idea, but if you don't link to it in the first place, the search engines won't know it's there (I accept the fact that users might do this for you or you might have done it long ago and can't reverse it). I would also suggest a trick on the spam spiders like this:

        1. Set up your robots.txt to disallow random directory.
        2. Put an index file in there that will add any ip that visits the page to your firewall blocklist.

        What happens is that ALL good spiders obey the robots.txt and the bad ones u
  • If you use AWSTATS (Score:2, Informative)

    by hairtrigger ( 736904 )
    There is a patch you can apply, available here [sourceforge.net] that will prevent referer spam from showing up in reports.
  • by Anonymous Coward
    http://www.google.com/googleblog/2005/01/preventi n g-comment-spam.html

    per googleblog:

    Q: How does a link change?
    A: Any link that a user can create on your site automatically gets a new "nofollow" attribute. So if a blog spammer previously added a comment like

    Visit my <a href="http://www.example.com/">discount pharmaceuticals</a> site.

    That comment would be transformed to

    Visit my <a href="http://www.example.com/" rel="nofollow">discount pharmaceuticals</a> site.

    --

    just add this f
  • At my own homepage (codesweep.com):

    A) The code for it is homemade, would be a pain in the butt to re-tool a bot for little old me vs. all the livejournal, blogger, etc. sites out there...
    B) I'm so insignificant out there with such low traffic the spammers probably wouldn't care anyways
    C) If the spammers do start caring, I can code my blog around them to defeat them. So far it hasn't created a problem, but the stronger the problem the stronger my response will be...
  • by aberson ( 461047 ) on Friday February 04, 2005 @12:45PM (#11573930) Homepage
    Comment spam can be easily stopped by requiring a password - you can even publish the password right on the website so humans see it and bots don't. I did it for moveable type and it was pretty easy [dumbengineer.com] as for referrer spam... it seems to me that the only way referrer spam is fruitful is if your log files are publicly visable and if they are parsed by google (etc), unless I don't understand referrer spam. So why not just remove all links to your logfiles, add a .robots file, and maybe even password protect where your logfiles are stored. I would assume that referrer spambot wouldn't even try to target your page unless it knew your referrer logs were linked off your page...
  • I've taken to filtering my e-mail with whois and by protocol deviations. I can see how I could be wrong, but I'm guessing that the same aproach can be thrown at the refer spammers, that:
    1> The headers their clients send are different than those of ordinary clients.
    2> That the properties revealed by whois are different for refer spammer clients than for ordinary clients.
    3> That the whois properties for the spam refer sites are different than those of legitimate sites.

    I'll bet that ignoring input f
  • Protect your stats (Score:1, Informative)

    by Anonymous Coward
    If you protect your stats with apache/whatever authentication then robots cant find your stats via google/whatever search engines, and they will probably stop spamming you. I find that every time i unprotect the stats for openphoto.net i get referer spam'd to death.

    $0.02,

    _Michael.
    • Nah. My stats page has never been visible without a password, and I get referrer spam all the time. But frankly, I don't care, because it isn't doing them any good.

      It's the comment/trackback spam that bugs me, and like another poster said, Spam Karma (on Wordpress, anyway) seems to be working wonders. (This is after trying built-in moderation, three strikes, stopgap, and several other methods)
  • Captcha any referral that's not white-listed.
    Captcha access to the referral log.
  • I added /stats/ to my robots.txt.

    The stats pages no longer show up on any search engines, so a) The spammers get no 'pagerank' from those links (which is what they do it for) and b) they can't find the stats pages.

    I was getting shitloads of referer spam; within a week (as soon as google updated) it dropped to nothing. I've had no referer spam AT ALL since then.

    Perhaps they'll start just crawling the entire web, but it appears that at the moment they do a google search to find pages that post their refere
  • by chongo ( 113839 ) * on Friday February 04, 2005 @03:02PM (#11575548) Homepage Journal
    We started seeing this type of spam back in June of 2004. In our case the referrer spam was attempting to get webalizer to create links in the "top N referrer" table back to their pron sites.

    Our initial attempt to solve this was to complain to the ISP of the referrer spammers. That did no good. The ISP was willing to listen, but not to act.

    We did manage to actually track down the jerks who were doing the referrer spam. They told us that they were attempting to create links back to their sites for better search engine placement.

    Our work-a-round was two fold. For various reasons we wanted to keep these our webalizer [mrunix.net] stats externally accessible. So we requested bots (the ones that follow the rules at least) to not index our external stats and we modified webalizer to not form links back to the referrers.

    We edited our robots.txt file to exclude legit bots from our stats:

    User-agent: *
    Disallow: /stats

    We also patched webalizer v2.01-10 [isthe.com] to no longer form URLs to referrers. Now only a plain text line without the leading http:// shows up in the table. The original referrer spammers gave up when they lost off the the links back to their sites.

    The bottom of the 0.basic.patch prevents webalizer [mrunix.net] from forming links back to referrers. See README-FIRST [isthe.com] for details on this patch set.

  • by JoeD ( 12073 ) on Friday February 04, 2005 @03:32PM (#11575849) Homepage
    My first suggestion would be to stop publishing the referrer links.

    But if you have to, then put "rel=nofollow" in the link itself. This makes Google (and other search engines) discard the link when calculating search rankings.

    Go here [google.com] for more info.
  • It was originally intended for comment spam, but just add the same rel="nofollow" to your referrer lists. Read about it [google.com]. Granted, this won't prevent it, but if everyone starts doing this, this technique will become useless for spammers.
  • mod_security (Score:3, Informative)

    by Imabug ( 2259 ) on Friday February 04, 2005 @07:51PM (#11578850) Homepage Journal
    I installed mod_security [modsecurity.org] on my server a few weeks ago with a few simple regexes to cover the more prolific referrer spammers recorded by awstats. Set the mod_security default action to deny,status:412. Then in httpd.conf I set the ErrorDocument for the 412 code to an empty file.

    Now when the referer spammer hits my site, they get denied and get nothing back. Bandwidth wasted serving up pages to referer spammers is cut to virtually nil. The spammers are still there banging away and a few still get by though. The list of referrers needs to be monitored so that new mod_security rules can be added as required. That's no different than using mod_rewrite to deny the referrer spammers though.
  • I believe the problem with spam relies in the stupid lusers that actually click on the links and purchase stuff from them. Lets take a look at some of the latest spam...

    Porn: anybody that wants good porn knows to look at p2p solutions (just look in the right spots, it's all there for free)
    viagra, etc: if you don't know that it doesn't work, you're an idiot
    free stuff: nothing in life is free
    special service: there are always string's attached
    correct your account information: if you get your identity "stolen"

You can not win the game, and you are not allowed to stop playing. -- The Third Law Of Thermodynamics

Working...