Microsoft Tracks Down Mass Fake Web Pages 135
An anonymous reader writes "According to an article on New York Times, Microsoft researchers have discovered tens of thousands of junk Web pages, created only to lure search-engine users to advertisements. While most of us have run across them from time to time, the company researchers have found the pages are deliberately generated in vast numbers by a small group of shadowy operators. By following the money trail, Microsoft researchers were able to track the flow from big-name advertisers to search engine spammers. Many use Google's blogspot.com to set up spam doorway pages. 'The practice has proved to be a vexing problem for the major search companies, which struggle to prevent both spammers and companies specializing in improving legitimate clients' Web traffic -- a field known as search-engine optimization -- from undermining their page-ranking systems. Surprisingly, the researchers noted that the vast bulk of the junk listings was created from just two Web hosting companies and that as many as 68 percent of the advertisements sampled were placed by just three advertising syndicators.' The report is available at Microsoft Strider Search Ranger project page."
The easy way (Score:5, Interesting)
Re:The easy way (Score:5, Funny)
Re:The easiest way (Score:4, Funny)
Re: (Score:1)
Bring out the torches and pitchforks! (Score:1, Interesting)
another ripoff (Score:4, Funny)
Re:another ripoff (Score:5, Funny)
Re: (Score:3, Funny)
Re: (Score:2, Funny)
You had Gopher as a kid!? Man, we were stuck with local BBS Sam & Frodo's ASCII Express Second Breakfast project.
Re: (Score:2, Funny)
Re: (Score:1)
Re: (Score:2)
Great (Score:1)
Re: (Score:1)
Why? (Score:5, Interesting)
Re:Why? (Score:5, Insightful)
Re: (Score:1, Insightful)
Re:Why? (Score:5, Informative)
Once you've optimised to your keywords in "natural search" e.g. *free* results, then your investment keeps paying (you need to maintain positions of course, but this is lower cost, especially if you're in a niche) whereas in paid advertising you have to keep giving money to Google and, in competitive industries, your cost per click will be subject to significant inflation...
Re: (Score:2)
Re: (Score:1)
People trust organic search results more, so even if they were more expensive to buy than paid for adverts, you'd get more bang for your buck.
People who click on adverts are less likely to 'convert' (buy and item, sign up for a newsletter etc) than people who click on a natural search result
Spam sucks bad, but if you can get into the top 20 of googles natural search, you have hit gold.
monk.e.boy
Re: (Score:2)
Re:Why? (Score:4, Insightful)
"time to time"? (Score:5, Insightful)
Time to time? For mee it seems like more than 50% when I scan the search results. Maybe less, maybe more, but certainly more than "time to time". For many of my searches, I may not find anything truly relevant until the second and third page. People have learned how to play Google to the point where more and more Windows Live is starting to give better results (scary!).
Re: (Score:2)
Re: (Score:1)
I hate those spam-my web sites ( the top 4 other sites ) because they keep people away from my site and a few others that have vacation rentals here in Miami.
On
Re:"time to time"? (Score:5, Funny)
Re: (Score:2)
Re: (Score:1)
True enough. I recently switched back to Yahoo! search after about five years of nothing but Google. I don't know if the results are any better, but it sure is a good change of pace.
Re: (Score:2)
Re: (Score:3, Insightful)
I beg your pardon... "Erotica" is a perfictly legitimate subject.
Re: (Score:1)
Ironically (Score:2)
Re: (Score:2)
Nice work (Score:5, Informative)
http://research.microsoft.com/SearchRanger/Spam_A
The cloning of popular blogs as been a scourge for a while now, both for manipulating search engines and good old fashioned advertising - using someone else's content to draw visitors in
Nice work (Score:3, Interesting)
Google does keep up, but quietly- anecdotally, last week I was searching for a certain spec ARM9 dev board (the VULCAN-Lite) with USD also as a search term and all kinds of fake keyword sites and eastern block bride services were in the top 20 results.
I sent Google feedback with my search terms (VULCAN-Lite +USD), explained what spam was popping up, and as I write this comment a few days later-- the Google search comes back clean (emp
Re: (Score:2, Interesting)
Re: (Score:2)
Hmm, I always had the impression that they use the feedback to seed a database of pages to test their spam-removal algorithms on. They claim that they "prefer automated solutions rather than manual removal".
One of my big annoyances is sites that are spidered by Google but require mere mortal visitors to purchase a subscription. For example, searches on certain technical subjects often return pages with IEEE publications - purchase this art
Re: (Score:1)
you have to understand that his servers were consistently being spider-ed and his bandwidth cost were way high. kill all spiders was his first thing then he made special changes.
Onepoint.
Re: (Score:2)
That's all fine with me, but then block Googlebot as well. Allowing Googlebot and not allowing 80% of the world population is called cloaking in my dictionary and Google should have removed the whole site from the index for that reason.
Re: (Score:1)
>>Allowing Google bot and not allowing 80% of the world population is called cloaking in my dictionary
no, if you read all the issues, most people could see his site, very few could not because scrapers were coming from those IP's. and 80%
anyway here is the view point from brett : http://blog.searchenginewatch.com/blog/051128-1616 06 [searchenginewatch.com]
Brett Tabke is a liar (Score:2)
I have read the stories about "we have a long list of blocked IP addresses and all the horrible bots are using my bandwidth". Brett Tabke is a liar. I have tried accessing his site from many different (static!) IPs in different /16 blocks and they were all blocked. Tabke's business model is to have an ad-free website and charge $180 per year for access to the site. He wants to att
Re: (Score:2)
As I said: fine with me if you do that, but the search engines should not be indexing you. And you should not be lying in public about the real reasons for your policy.
Re: (Score:2)
Why? If I place my content to view, the engines that have clean IP's will clear my systems but those that are from other locations wont. so Google in asia won't see me but Google USA will. ( and that's been tested already with google, but not with yahoo)
if you choose to use Google USA to search but not your local brand it's not my problem, Google makes it easy.
currently most of Asia does not see certain sites that I manage ( my personal sites are wor
Re: (Score:2)
Provide me a reference or example that country-blocking websites will not show up in nationalized versions. Google.nl, google.fr, google.sv, etc. only give a slightly different ordering of the search results (slightly preferring certain TLDs and pages written in the local language).
Re: (Score:2)
Why not discard hidden links? (Score:2)
Of course, the scammers would just try some other tactic -- perhaps hiding links in Z-layers behind opaque graphic
Re: (Score:2)
I bet the spammers would just start using really obfuscated javascript to set the style = display:none. So, you'd be starting an arms race where search spiders would have to start processing javascript and then the spammers would just come up with something else (maybe set the z-index low so that the links can't be seen). It just doesn't seem like it's worth the effort.
I use display:none all the time by the way. The left column of slashdot has th
Re: (Score:2)
Then Microsoft realized... (Score:5, Funny)
It's coming from inside the building!!!
Re: (Score:1)
How does this help them? (Score:2)
Theories:
(1) There's a subtle way that it helps I haven't spotted yet, perhaps to do with non-PageRank elements of Google's search ordering
(2) This is all done by a very few companies because they are the few that don't understa
Re:How does this help them? (Score:4, Insightful)
Re: (Score:3, Interesting)
If there's only so much karma going into your pages, there's only so much karma they have to give, no matter how huge it is. A trillion pages pointing at my page won't increase its karma, if those trillion have no karma to give.
Re: (Score:3, Insightful)
The scummiest part of it all is that some of the pages in question will be on domains that someone let expire and someone else immediately snatched up. They get their PageRank from the sites that linked to the formerly legitimate domain. And if that was your domain name, and you only let it expire accidentally, well, sucks to be you. :(
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Every page has to start with some small, intrinsic amount of karma, otherwise there would be none to pass around. By creating enough bogus pages, you can aggregate some amount of link karma to bestow on the site of your choosing. In principle, I guess this would devalue everyone's PageRank too (kind of like printing money), but for a while it could be profitable.
The second hole is the popularity of websites with user-generated content. Lots of highly ranked websites (like /. in fact) allow anyone, or
Re: (Score:2)
There has to be a "root set", but that root set doesn't have to consist of all pages. There's some evidence that it includes all top-level pages, because the Scientologists experimented with creating zillions of top-level domains to increase their Google ranking. But ordinary pages, as I understand it, have no intrinsic karma at all.
Yeah, blog SEO spam is a great evil irritant. I do understand
Re: (Score:2)
That's not how it works. You assume it's a zero-sum game, but it's not. Every page gets some weight even if no one links to it. It's small, but it's positive. When one page links to another, the weight of the source page is reduced less than the target page gains. So, here is the business plan:
1. Make a lot of unique pages (G in the PR calculation joins identical or nearly identical pages)
2. Crosslink them in a n
Re: (Score:2)
That's not the impression I'm under - I thought that most pages were not part of Google's "root set". See my reply here:
http://slashdot.org/comments.pl?sid=227331&thresho ld=1&commentsort=3&mode=thread&pid=18413697#184137 87 [slashdot.org]
Re: (Score:2)
I understand that you have such an impression, but that's a wrong impression. Every page gets a non-zero weight by default. If you think about it you will see that your scheme just would not work: emerging subjects/sites would stay with zero PR for a long long time until links to them propagate all the way to the "roots".
Re: (Score:2)
You're mistaken about your argument against, in any case; PageRank itself is public information, so I can tell you that it does not have the property you assign to it. There's a delay between a link being made and Google spidering and discovering it, but the eigenvector calculation at the heart of PageRank will propogate
Re: (Score:2)
How do you know that?
I can tell you that it does not have the property you assign to it
The delay I mentioned is due to links being made, not links being discovered. Think about some small community of scientists making an almost closed cluster of sites about their niche research subject.
Re: (Score:2)
The delay I mentioned is due to links being made, not links being discovered. Think about some small community of scientists making an almost closed cluster of sites about their niche research subject.
There is simply no way for Google to know that those pages are any good until people start linking to them. Fortunately it doesn't take long - for example, the scientists will get karma from the links from their institution front page
Re: (Score:2)
It's only dark and secret for a newbie
There is simply no way for Google to know that those pages are any good until people start linking to them.
Exactly, except turned upside down. It's "there is no way for Google to know that those pages are spam", so they get positive weight until proven otherwise.
from the links from their institution front pages
A few links will make the cluster discoverable by crawlers but won't make a difference for PR. It's the cross links withi
Re: (Score:2)
Re: (Score:2)
Re: (Score:1)
And? (Score:3, Interesting)
On the other hand, what idiot spouts off about two hosting companies being responsible without naming them? Seriously. This isn't Fark, you can't get kicked off for calling some asshole out.
Re:And? (Score:4, Insightful)
Re: (Score:2)
so, one down, one to go. Its still a shame the offending company was not named, but I imagine it doesn't exist anymore, wound up and is now reborn as a differently n
And in other news... (Score:4, Funny)
In other news, Microsoft researchers have discovered that the sky is blue and that water is wet.
Re: (Score:3, Funny)
I live in London, you insensetive clod!
Re: (Score:2)
Discovering that the sky is blue is quite a discovery for a company based near Seattle. They should have known about water though, given all the rain they get.
Obligatory Bill Hicks (Score:5, Funny)
Re: (Score:1)
Bad neighborhoods (Score:3, Interesting)
A few years ago... (Score:4, Interesting)
Microsoftie wearing a white hat? (Score:5, Insightful)
So -- from an admitted open source advocate -- here's a rare kudo to the giant in Redmond for keeping a "white hat" and his group -- and letting them work.
Re: (Score:3, Interesting)
Re: (Score:2)
is this research reliable (Score:2)
Firefox is good. (Score:2, Informative)
I often wonder... (Score:1)
What's the point? (Score:1)
This is research? (Score:1, Flamebait)
To play devil's advocate... (Score:1)
Re: (Score:2)
So is Google.
Selah.
wait... (Score:1)
MIA: Marketing dept. (Score:1)
Well, here's 100,000 spam domains (Score:1)
How did they search this out? (Score:2)
Seriously, I have had phishing email for some of these 80.77.x.y websites recently as well. A "Good on ya!" to MicroSoft [microsoft.com] & UC Davis [ucdavis.edu]! Root the bastards out and stomp 'em!
Wow (Score:1)
Timing (Score:2, Insightful)
I am very glad I read the detailed report from end to end. We seek value in advertising, not spam, but it is very difficult for well meaning companies to figure out which is which. You shouldn't have to be a rocket scientist to differentiate the deceptive tactics/companies from the valid ones. I guess most forms
What is web spam? Ads from phony businesses. (Score:2)
This is good work by Microsoft. They've tracked down a few big-time web spammers, all the way up the food chain. But there are more.
We've been working on the web spam problem, from a different angle. Our starting point is the legal requirement that a business cannot be anonymous. Every legitimate business must have an identifiable person or corporation behind it. (See CA B&P code sec. 17358 [sitetruth.com], ("disclosure of ... legal name and address information shall appear on ...
the first screen displayed ...
One Mans "Junk" Is Another mans "Diamond' (Score:1)
There is NO SUCH thing as "spamming a Search En
Re: (Score:2)
Re: (Score:1)