Stories
Slash Boxes
Comments

News for nerds, stuff that matters

Slashdot Log In

Log In

Create Account  |  Retrieve Password

Microsoft Tracks Down Mass Fake Web Pages

Posted by Zonk on Tue Mar 20, 2007 08:38 AM
from the going-from-point-a-to-z74 dept.
An anonymous reader writes "According to an article on New York Times, Microsoft researchers have discovered tens of thousands of junk Web pages, created only to lure search-engine users to advertisements. While most of us have run across them from time to time, the company researchers have found the pages are deliberately generated in vast numbers by a small group of shadowy operators. By following the money trail, Microsoft researchers were able to track the flow from big-name advertisers to search engine spammers. Many use Google's blogspot.com to set up spam doorway pages. 'The practice has proved to be a vexing problem for the major search companies, which struggle to prevent both spammers and companies specializing in improving legitimate clients' Web traffic -- a field known as search-engine optimization -- from undermining their page-ranking systems. Surprisingly, the researchers noted that the vast bulk of the junk listings was created from just two Web hosting companies and that as many as 68 percent of the advertisements sampled were placed by just three advertising syndicators.' The report is available at Microsoft Strider Search Ranger project page."
+ -
story
This discussion has been archived. No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More
Loading... please wait.
  • The easy way (Score:5, Interesting)

    by truthsearch (249536) on Tuesday March 20 2007, @08:43AM (#18413315) Homepage Journal
    They could have saved a lot of time and money by just visiting forums like DigitalPoint. These doorways and other spammy sites are for sale every day. It's no secret.
  • by gEvil (beta) (945888) on Tuesday March 20 2007, @08:46AM (#18413353)
    Man. This Microsoft project is just a ripoff of Google's Gandalf Search Wizard project...
    • The report is available at Microsoft Strider Search Ranger project page.

      Man. This Microsoft project is just a ripoff of Google's Gandalf Search Wizard project...
      Yeah, but let's not forget that even before that was AOL's Smeagol Browser Gollum project ...
      • The report is available at Microsoft Strider Search Ranger project page.

        Man. This Microsoft project is just a ripoff of Google's Gandalf Search Wizard project...

        Yeah, but let's not forget that even before that was AOL's Smeagol Browser Gollum project
        When I was a kid, all we had was the U of Minnesota's Sauron Gopher Overlord project...
    • Register.com among the Businesses, Melbourne IT to the Australians; Tucows I was in my youth that is forgotten, in the South ENom, in the North GoDaddy, to the East I go not...
  • Why? (Score:5, Interesting)

    by Herkum01 (592704) on Tuesday March 20 2007, @08:48AM (#18413379)
    Is it really cheaper to use Page Ranking companies instead of just well, PAYING for an advertisement on Google or MSN or something?
    • Re:Why? (Score:5, Insightful)

      by Frosty Piss (770223) on Tuesday March 20 2007, @08:52AM (#18413433)

      Is it really cheaper to use Page Ranking companies instead of just well, PAYING for an advertisement on Google or MSN or something?
      Yes, or they wouldn't do it.
      • Re:Why? (Score:5, Informative)

        by fruey (563914) on Tuesday March 20 2007, @09:11AM (#18413683) Homepage Journal
        The average return on investment on Search Engine Optimisation (generally: increasing your search position on specific keywords relevant to your business) can be about 10x more than the return on keyword purchasing, which can cost 0.30c - several dollars. Every click costs money.

        Once you've optimised to your keywords in "natural search" e.g. *free* results, then your investment keeps paying (you need to maintain positions of course, but this is lower cost, especially if you're in a niche) whereas in paid advertising you have to keep giving money to Google and, in competitive industries, your cost per click will be subject to significant inflation...
      • Sometimes businesses do stuff that doesn't work out -- they go bankrupt everyday.
    • it may not be cheaper, but it may be more effective. search engines generally identify results that were purchased, and i'm sure a user is less like to click on it if they see that. the clients of these companies are buying their way into the results without have to be in that section.
    • Re:Why? (Score:4, Insightful)

      by terraformer (617565) <terraformer@terranovum.com> on Tuesday March 20 2007, @09:42AM (#18414093) Homepage Journal
      It is also more effective. How many times do you click on ads? Now how many times do you click on search results? 'nough said...
  • "time to time"? (Score:5, Insightful)

    by Frosty Piss (770223) on Tuesday March 20 2007, @08:49AM (#18413391)

    While most of us have run across them from time to time...

    Time to time? For mee it seems like more than 50% when I scan the search results. Maybe less, maybe more, but certainly more than "time to time". For many of my searches, I may not find anything truly relevant until the second and third page. People have learned how to play Google to the point where more and more Windows Live is starting to give better results (scary!).

    • Maybe the best thing to do is to automatically skip to the 2nd page of results and write off the first page as search engine spam.
      • Re: (Score:3, Insightful)

        I have never seen results that bad. You must be searching for porn, where spam is to be expected.

        I beg your pardon... "Erotica" is a perfictly legitimate subject.

  • they harvested most of their results from Google.
  • Nice work (Score:5, Informative)

    by MysteriousPreacher (702266) on Tuesday March 20 2007, @08:53AM (#18413449) Homepage Journal
    There's actually some pretty decent research here. The site cloning report is a good read.

    http://research.microsoft.com/SearchRanger/Spam_At tack_by_Website_Clones.htm [microsoft.com]

    The cloning of popular blogs as been a scourge for a while now, both for manipulating search engines and good old fashioned advertising - using someone else's content to draw visitors in
    • Nice work (Score:3, Interesting)

      Thanks for a informative post. Beats the typical whiny M$ iz S4T4|\| crap.

      Google does keep up, but quietly- anecdotally, last week I was searching for a certain spec ARM9 dev board (the VULCAN-Lite) with USD also as a search term and all kinds of fake keyword sites and eastern block bride services were in the top 20 results.

      I sent Google feedback with my search terms (VULCAN-Lite +USD), explained what spam was popping up, and as I write this comment a few days later-- the Google search comes back clean (emp
      • Re: (Score:2, Interesting)

        You are 100% correct that Google does help clean up it's searches. I do about 100 web searches a day to learn stuff, every time I come across spammy results I send Google a note. I think it's working, because the next week when I want to learn more on a topic it's much improved
  • by physicsboy500 (645835) on Tuesday March 20 2007, @08:55AM (#18413465)

    It's coming from inside the building!!!

  • PageRank is designed to be resistant to exactly this sort of attack. The amount of Google karma you get is proportional to the karma of the pages that link to you. Creating lots of pages with no karma that link to you therefore shouldn't do you any good at all. Why do they bother?

    Theories:

    (1) There's a subtle way that it helps I haven't spotted yet, perhaps to do with non-PageRank elements of Google's search ordering

    (2) This is all done by a very few companies because they are the few that don't understa
    • by jandrese (485) <kensama@vt.edu> on Tuesday March 20 2007, @09:02AM (#18413581) Homepage Journal
      It works because you don't realize the size of this thing. They're talking about millions of fake pages here, lots of them pointing at other fake pages to raise their pagerank so they can in turn point at yet more pages. You would think Google would have someone seeking these kind of sites out and applying a discount on their domain though (although when that happens the spammers just move on anyway).
      • Re: (Score:3, Interesting)

        Er, that sounds like the old saw "we lose a penny on each one sold, but we make it up in volume".

        If there's only so much karma going into your pages, there's only so much karma they have to give, no matter how huge it is. A trillion pages pointing at my page won't increase its karma, if those trillion have no karma to give.
        • Re: (Score:3, Insightful)

          Presumably some of these trillion pages have a karma greater than or equal to epsilon.

          The scummiest part of it all is that some of the pages in question will be on domains that someone let expire and someone else immediately snatched up. They get their PageRank from the sites that linked to the formerly legitimate domain. And if that was your domain name, and you only let it expire accidentally, well, sucks to be you. :(

    • Every page has to start with some small, intrinsic amount of karma, otherwise there would be none to pass around. By creating enough bogus pages, you can aggregate some amount of link karma to bestow on the site of your choosing. In principle, I guess this would devalue everyone's PageRank too (kind of like printing money), but for a while it could be profitable.

      The second hole is the popularity of websites with user-generated content. Lots of highly ranked websites (like /. in fact) allow anyone, or

      • Every page has to start with some small, intrinsic amount of karma, otherwise there would be none to pass around.

        There has to be a "root set", but that root set doesn't have to consist of all pages. There's some evidence that it includes all top-level pages, because the Scientologists experimented with creating zillions of top-level domains to increase their Google ranking. But ordinary pages, as I understand it, have no intrinsic karma at all.

        Yeah, blog SEO spam is a great evil irritant. I do understand
    • Creating lots of pages with no karma that link to you therefore shouldn't do you any good at all

      That's not how it works. You assume it's a zero-sum game, but it's not. Every page gets some weight even if no one links to it. It's small, but it's positive. When one page links to another, the weight of the source page is reduced less than the target page gains. So, here is the business plan:
      1. Make a lot of unique pages (G in the PR calculation joins identical or nearly identical pages)
      2. Crosslink them in a n
      • Every page gets some weight even if no one links to it. It's small, but it's positive.

        That's not the impression I'm under - I thought that most pages were not part of Google's "root set". See my reply here:

        http://slashdot.org/comments.pl?sid=227331&thresho ld=1&commentsort=3&mode=thread&pid=18413697#184137 87 [slashdot.org]
        • That's not the impression I'm under - I thought that most pages were not part of Google's "root set"

          I understand that you have such an impression, but that's a wrong impression. Every page gets a non-zero weight by default. If you think about it you will see that your scheme just would not work: emerging subjects/sites would stay with zero PR for a long long time until links to them propagate all the way to the "roots".
          • Since the answer is a closely guarded secret within Google, it's always fun to be contradicted by someone speaking in authoritative tone of voice who knows as little about this as I do :-)

            You're mistaken about your argument against, in any case; PageRank itself is public information, so I can tell you that it does not have the property you assign to it. There's a delay between a link being made and Google spidering and discovering it, but the eigenvector calculation at the heart of PageRank will propogate
            • who knows as little about this as I do

              How do you know that?

              I can tell you that it does not have the property you assign to it

              The delay I mentioned is due to links being made, not links being discovered. Think about some small community of scientists making an almost closed cluster of sites about their niche research subject.

              • Oooh, hints of dark and secret knowledge! Those are always very impressive.

                The delay I mentioned is due to links being made, not links being discovered. Think about some small community of scientists making an almost closed cluster of sites about their niche research subject.

                There is simply no way for Google to know that those pages are any good until people start linking to them. Fortunately it doesn't take long - for example, the scientists will get karma from the links from their institution front page
    • our site is actually working with one of these companies (on the receiving end of the paycheck, though). they want to put "ads" on our site that link to other sites. they dont care at all what the ads look like or where they are on the page, but just that there's a link to another site. and the link has to be search-indexable (no javascript). all they care about is boosting the rank of their clients, not the number of clicks.
  • And? (Score:3, Interesting)

    by jafiwam (310805) on Tuesday March 20 2007, @08:58AM (#18413523) Homepage Journal
    Ok. Forgive me if MS just discovering this makes me think they just entered 2002. That crap is _not_ new folks.

    On the other hand, what idiot spouts off about two hosting companies being responsible without naming them? Seriously. This isn't Fark, you can't get kicked off for calling some asshole out.
    • Re:And? (Score:4, Insightful)

      by Sirch (82595) on Tuesday March 20 2007, @09:05AM (#18413627) Homepage
      ... but you can get sued for libel if you're wrong.
    • but the best bit: Phillip Rosenthal, chief technology officer of one of the companies, ISPrime, an Internet services company based in New York, said the activity had been traced to a single customer and violated the company's acceptable-use policy. He said the company's relationship with the customer, whom he would not identify, had been severed

      so, one down, one to go. Its still a shame the offending company was not named, but I imagine it doesn't exist anymore, wound up and is now reborn as a differently n
  • by sconeu (64226) on Tuesday March 20 2007, @09:00AM (#18413553) Homepage Journal
    Microsoft researchers have discovered tens of thousands of junk Web pages, created only to lure search-engine users to advertisements.

    In other news, Microsoft researchers have discovered that the sky is blue and that water is wet.
    • Microsoft researchers have discovered that the sky is blue

      I live in London, you insensetive clod!
  • by Thaelon (250687) on Tuesday March 20 2007, @09:11AM (#18413685)
    Obligatory Bill Hicks...

    If you work in advertising, kill yourself.
    --Bill Hicks - Another Dead Hero
  • Bad neighborhoods (Score:3, Interesting)

    by condour75 (452029) on Tuesday March 20 2007, @09:13AM (#18413699) Homepage
    Google is already developing methods to deal with clusters of these fakes. Usually they're scraping web directories and databases. I've seen a lot of this lately, searching for dental hygiene schools for my girlfriend. Usually they're linking to each other, even if they're huge clusters. Legit SEO guys (yes, there are consultants who actually try to get your site linked legitimately and by hand) call these areas "bad neighborhoods". Whatever Google's doing, though, clearly isn't enough, and a lot of these guys are using adsense to make money. Martinibuster's [martinibuster.net] got a few good links on the subject.
  • A few years ago... (Score:4, Interesting)

    by AliasTheRoot (171859) on Tuesday March 20 2007, @09:22AM (#18413815)
    ...a friend of mine figured he could get great Google listings by autogenerating trashy link farm pages, he had the top 1000 porn search terms all cunningly mispelled, ie "Brittney Spares" and hundreds of thousands of static pages all linking into each other across a bunch of subdomains. For about a year we reckoned he had some stupid percentage of all porn listings in Google, and in that time he made around $1,000,000 from banner clicks. Eventually Google caught onto it and blocked his sites enmass, but he'd made enough to buy some property by then.
  • I just finished reading how much the Strider group at M$ has accomplished and how, and it is rather amazing. They lifted the covers off of typo-domain squatters exploiting Google's programs, a progressive honeypot setup that detects which levels of XP are attackable by different mal-ware attacks (up to and including reporting zero-day exploits if the latest "patch hardened" machine is exploited], and now this project. Even better, they are publishing the "how", and any OS (AKA Mac OS or any of the Linux distros) could benefit by using similar approaches on even more machines.

    So -- from an admitted open source advocate -- here's a rare kudo to the giant in Redmond for keeping a "white hat" and his group -- and letting them work.
    • I agree. Whatever else you say about MS, and there's lots to say, they seem to have given their security researchers a lot of freedom and because of their size and power have the resources and brainpower to tackle these problems in pretty cool ways. The sad thing, as with much of what comes out from MS, is that you see these really smart, awesome people doing great work, but when it comes to taking their own advice, you can see quite directly the way that the vast bureaucracy and Microsoft's avaricious co
  • I read the research paper a couple days ago after reading about it in the NY Times. Seeing how this research is Microsoft funded and implicates Google, claiming they're syndicators are in cooperation with the spammers, one has to question researcher bias. I'd like to see a peer-reviewed and independently verified article before accepting these outrageous claims. Note that the researchers focused on a few keywords and strictly limited the scope of their efforts. This doesn't mean the findings are untrue, it
  • Firefox has an extension called customizegoogle [customizegoogle.com]. It adds a 'filter' option to a google results page. Allows one to filter out the sneaky pages that hi-jack your search query.