The Man Behind Google's Ranking Algorithm 115
nbauman writes "New York Times interview with Amit Singhal, who is in charge of Google's ranking algorithm. They use 200 "signals" and "classifiers," of which PageRank is only one. "Freshness" defines how many recently changed pages appear in a result. They assumed old pages were better, but when they first introduced Google Finance, the algorithm couldn't find it because it was too new. Some topics are "hot". "When there is a blackout in New York, the first articles appear in 15 minutes; we get queries in two seconds," said Singhal. Classifiers infer information about the type of search, whether it is a product to buy, a place, company or person. One classifier identifies people who aren't famous. Another identifies brand names. A final check encourages "diversity" in the results, for example, a manufacturer's page, a blog review, and a comparison shopping site."
Hrm, and all this time I though it was... (Score:4, Funny)
Re:Hrm, and all this time I though it was... (Score:4, Informative)
MoneyRank (Score:1)
For each page in results
if page.HasAdwords=true and Not page.content=junk
page.MoneyRank= page.clickthroughrate * page.AdwordsValue
results.add page
else
Ignore
Endif
next page
results order by Moneyrank DESC
apple vs Apple (Score:1, Informative)
Well the results for both "apple" and "Apple" are identical for me (apple computer dominated), with the exception of the text in the ads on the right hand side (which are both for apple computers). Maybe they are doing other stuff (Linux users prefer computers over fruit?).
Does anyone see anything different when they search for "apple" versus "Apple"?
Re:apple vs Apple (Score:5, Informative)
Re: (Score:2, Funny)
Re: (Score:2)
Re: (Score:1)
Re: (Score:1)
I'll buy you some tampons and show you how to use them. Meet me after school. I'll be in the parking lot with a grey van.
^_^
Re: (Score:1)
Amit Singhal ... (Score:5, Informative)
Re: (Score:3, Funny)
Re: (Score:1, Troll)
Re: (Score:1)
...only one? (Score:5, Funny)
How many did they expect PageRank to be? In the words of someone immortal, "There can be only one.".
Re: (Score:2)
Re: (Score:3, Interesting)
North America Centric (Score:1)
I'll search for a product and the first page of results will all be *.co.uk results.
Not much use to that. Makes me think on how to rephrase the search, which is good.
Re: (Score:2)
Re: (Score:2, Informative)
Re: (Score:1)
Re: (Score:3, Funny)
digestives london
digestives london -inurl:.uk
Feature Request (Score:5, Insightful)
I would love a switch, or even a subscription, that would allow me to filter these usually useless types of pages and instead show me pages with real content.
Re: (Score:3, Funny)
This item you're searching for hasn't been in inventory for 6 years since nobody makes it anymore, would you like to read a review ? : be the first to write one !
Yay.
Re: (Score:2)
I find that most queries give me what I want right away (eg paris hilton), and those that don't (eg lindsay lohan) do give me what I want after narrowing down the sites returned (eg lindsay lohan drunk car -herbie -vomit -intitle:"fan site").
Re: (Score:2)
Re:Feature Request (Score:5, Informative)
http://www.givemebackmygoogle.com/ [givemebackmygoogle.com]
It just negates a whole lot of affliate sites.
This is part of the query it feeds to Google.
-inurl:(kelkoo|bizrate|pixmania|dealtime|pricerun
Re: (Score:1)
This (http://www.myserp.com/) probably does it better.
It does this:
Pretty cool huh?
monk.
Re: (Score:1)
Re:Feature Request (Score:5, Informative)
You can filter out Wikipedia mirrors (using that extension) with the list here: http://meta.wikimedia.org/wiki/Mirror_filter [wikimedia.org]
Re: (Score:2)
Ditto for Google News. I'd love to click something and have all the worthless blogs trying to pass for journalism disappear from the results.
Even worse is that Google News gives high rankings to some "news" web sites that merely steal the content of other sites and then re-publish it as their own. I'm not talking about link aggregators like Fark
Re: (Score:2)
So if they have an algorithm to ensure that the results contain a good mix including comparison shopping sites, doesn't that imply that they could technically provide exactly the kind of switch that the parent poster asked for - i.e. to exclude those comp
Re: (Score:2)
Then create your own Google Custom Search Engine [google.com] or use some existing ones such as Google Search Excluding Shops [rapla.net] that's excluding hand picked 700+ shopping and spam sites and gives ranking boost to 160+ websites of IT and other electronics companies.
Re: (Score:1)
Re: (Score:1)
The tinfoil hat
Many other things are goo(gle)d (Score:3, Interesting)
Re: (Score:1)
Re:Many other things are goo(gle)d (Score:4, Insightful)
Re: (Score:2, Interesting)
Re: (Score:1)
Re: (Score:2, Informative)
How does it work (Score:5, Informative)
Google breaks pages in words. Then, for evey word it keeps a set which contains all the pages (by hash ID) that contain that word. A set is a data structure with O(1) lookup.
When you search for "linux+kernel" google just does the set union operation on the two sets.
Now a "word" is not just a word. In google sees that many people use the combination linux+kernel, a new word is created, the linux+kernel word and it has a set of all the pages that contain it. So when you search for linux+kernel+ppp we find the union of the linux+kernel set and the "ppp" set.
So every time you search, you make it better for google to create new words. And this is part of the power of this search engine. A new search engine will need some time to gather that empirical data.
Of course, there are ranks of sets. For example, for the word "ppp" there are, say, two sets. The pages of high rank that contain the word ppp, and the pages of low rank. When you search for ppp+chap, first you get the set union of the high rank sets of the two words, etc.
Now page rank has several criteria. Here are some:
well ranked site/domain, linked by well ranked page, document contains relevant words, search term is in the title or url, page rank not lowered by google emploee (level 1), page rank increased, etc.
It is not very difficult actually.
(posting AC for a reason).
Re: (Score:1)
that is cleverly simple actually!
well explained
Thank you!
Now I understand (Score:5, Funny)
Googling Uncommon Characters and Exact Phrases (Score:3, Interesting)
Re:Googling Uncommon Characters and Exact Phrases (Score:4, Informative)
Re:Googling Uncommon Characters and Exact Phrases (Score:4, Informative)
Yes. Try to find information on the web about the language "C+@". It's real, and it was developed at Bell Labs some years ago back in the Plan 9 era, but it's unsearchable.
Re: (Score:1)
Re: (Score:2)
So how does Google know to tailor its results for C, C++, and C#, which all return results specific to the requested language, but not for C+@?
Manually implemented special cases, perhaps. Or Google may not consider the possibility that "@" can be part of a word, which is likely.
Re: (Score:2, Interesting)
Found (Score:1)
Also try calico. (aka)
Re: (Score:2)
True, but I'd hope that at least using quotation marks to search for phrases would also include special characters.
I mean, there can't be any search logic inside quotes anyway; then that would be part of the phrase.
Like "Apples or oranges" won't search for either apples or oranges, but the actualy phrase.
Re: (Score:2)
Re: (Score:2, Insightful)
One of the most annoying things about google for me is how it interprets queries with strange characters common to almost all programming languages.
You should try google code search [google.com].
Re: (Score:3, Insightful)
Re: (Score:1)
One search feature (Score:5, Interesting)
This could allow for a better search result when using for example "APPLE NEAR MACINTOSH" or "APPLE NEAR BEATLES"
Ho hum... Times changes and not always for the better...
Re: (Score:2)
I think NEAR is implied (Score:2)
What I miss from Alta Vista is the ability to go grouping to set precedence, i.e., parenthesis. I don't have to do this very often, but when I do, I really miss it. The need generally
Re: (Score:2)
A way to get that (Score:3, Informative)
Single Quotes (Score:1)
Isn't that what the single quote (') construct is for: 'widget offbeat'
Toileat seat (Score:4, Funny)
Google is human too (Score:5, Insightful)
if only... (Score:1)
The most annoying thing about Google's results... (Score:1, Insightful)
Blogs are read only by bloggers and the press, and present absolutely no interest to normal people (including me). Currently, because of google's idiotic blog fetish, I have to eliminate 50% of the results just based on URLs, hoping that I won't stumble upon someone's personal ramblings. Blogs became popular only due to google's absolutely unexplainable love to blog content, and sticking it into perfectly normal search results, it's like searching in a
Re:The most annoying thing about Google's results. (Score:2)
Considering that you're reading a blog, I think it's pretty fair that your only counting web pages that you think suck as blogs... so of course you don't like the results. Amazingly, no one is willing to tag their blog as "shohat will think this sucks, so please don't search me."
Re: (Score:2, Insightful)
Re: (Score:2)
If blogs didn't exist we'd just have more geocities pages getting lots of links.
Page rank is only a part of the story (Score:2)
Those are the thi
Re:Page rank is only a part of the story (Score:4, Informative)
A classifier is a black box which takes some data as input, and computes one or more scores. The simplest example is a binary classifier, say for spam. You feed some data (eg an email) and you get a score back. If it's a big score say, then the classifier thinks it's spam, and if it's a small score it's not spam. More generally, a classifier could give three scores to represent spam, work, home, and you could pick the best score to get the best choice.
So you should really think of a classifier as a little program that does one thing really well, and only one thing. For example, you can build a small classifier that looks if the input text is english or russian. That's all it does.
Now imagine you have 100 engineers, and each engineer has a specialty, and each builds a really small classifier to do one thing well. The logic of each classifier is black boxed, so from the outside it's just a component, kind of like a lego brick. What happens when you feed the output of one lego brick to the input of another lego brick?
Say you have three classifiers: english spam recognizer, russian spam recognizer, english/russian identifier. You build a harness which uses the english/russian identifier first, and then depending on the output your program connects the english spam recognizer or the russian spam recognizer.
Now imagine a huge network with some classifiers in parallel and some classifiers in series. At the top there's the query words, and they travel through the network. One of the classifiers might trigger word completion (ie bio -> biography as in the article), another might toggle the "fresh" flag, or the "wikipedia" flag etc. In the end, your output is a complicated query string which goes looking for the web pages.
The key idea now is to tweak the choice thresholds. To do that, there's no theory. You have to have a set of standard queries with a list of the outputs the algorithm must show. Let's say you have 10,000 of these queries. You run each query through the machine, and you get a yes/no answer for each one, and you try to modify the weights so that you get a good number of correct queries.
Of course you want to speed things up as much as possible, you can use mathematical tricks to find the best weights, you don't need to go get the actual pages if your output is a query string you just compare the query string with the expected query string etc, but that would be depend on your classifiers, the scheme used to evaluate the test results, and how good your engineers are.
The point is that there's no magic ingredient, it's all ad-hoc. Edison tried a hundreds of different materials for the filament in his lightbulb. Google is doing the same thing according to the article. What matters for this kind of approach is a huge dataset (ie bigger than any competitors') and a large number of engineers (not just to build enough components, but to deprive its competitors of manpower). The exact details of the classifier components aren't too important if you have a comprehensive way of combining them.
I'm familiar with all this stuff (Score:3, Interesting)
Re:I'm familiar with all this stuff (Score:4, Interesting)
When you say that your system is limited by human involvement, I presume you mean that implementing new features can have serious impact on the overall design (and therefore on testing procedures)? Feel free to not answer if you can't.
One thing I found interesting in the article is that Google's system sounds like it scales well. It reminded me of antispam architectures like Brightmail's (if memory serves), which have large numbers of simple heuristics which are chosen by an evolutionary algorithm. The point is that new heuristics can be added trivially without changing the architecture. I think their system used 10,000 when they described it a few years ago at an MIT spam conference. Adjustments were done nightly by monitoring spam honeypots.
I'd love to see better competition in the search engine space. I hope you succeed at improving your tech.
Page Rank is a HW assignment (Score:2)
Break through! (Score:1)
>>A search-engine tweak gave more weight to pages with phrases like "French Revolution" rather than pages that simply had both words.
So, now search engines are giving more importance to connected words rather than scattered words. How refreshing!
Re: (Score:1)
The Man Behind Google's Ranking Algorithm (Score:2)
http://www.google.com/technology/pigeonrank.html [google.com]
"Millions Of Black Boxes"? (Score:4, Interesting)
"Google rarely allows outsiders to visit the unit, and it has been cautious about allowing Mr. Singhal to speak with the news media about the magical, mathematical brew inside the millions of black boxes that power its search engine."
I could see tens of thousands, maybe hundreds of thousands, but millions?
Re: (Score:1)
IC's, perhaps (Score:2)
It's in Google's interest to have competitors think of it as bigger than it is.
So, if they count each IC on a mobo or drive controller, they probably do have millions of black boxes at Google, literally.
Alternately, they could be talking about algorithms, instances thereof, etc., though I like the black IC's better.
Re: (Score:3, Informative)
This [baselinemag.com] is from a year ago (July 2006):
If this figure is accurate, a million boxen nowadays doesn't seem out of reach.
do no evil? (Score:1)
But then they changed the algorithm and now Google Finance site is at the top.
Re: (Score:2)
Re: (Score:1)
Re: (Score:2)
If they just gave it a few months people would link to it, it would get older etc and its ranking would boost over time. That is the stock response they would give to anyone else that complained. I don't know why they think their algorithm has to list their product as first overnight
Re: (Score:2)
The most informative line in the article... (Score:1)
Old google data (Score:1)
When i search Google usually gives me information from 2001, 2002, 2003 and it is hard to tell it i want only data from 2006/2007. The problem is that the sites that end up in the search constantly refresh the ads and links around their old stories which makes google think its fresh.
This was not
Re: (Score:3, Insightful)
Re: (Score:1)
Re: (Score:2)
Re:Google sucks. (Score:5, Funny)
Re: (Score:2, Insightful)