Google’s Cutts Explains How Google Search Works

Google’s head of web spam, Matt Cutts, posted a 8 minute video on how Google search works. From crawling, indexing to ranking, he gets into a brief overview of how Google’s search engine does its job.

Matt explains how PageRank is used, crawling timelines, frequencies, priorities, indexing and filtering processes within the databases.

Here is the video:

Here is the transcript provided by YouTube:

0:00
0:00 MATT CUTTS: Hi, everybody.
0:01 We got a really interesting and very expansive question
0:04 from RobertvH in Munich.
0:06 RobertvH wants to know–
0:09 Hi Matt, could you please explain how Google’s ranking
0:12 and website evaluation process works starting with the
0:14 crawling and analysis of a site, crawling time lines,
0:18 frequencies, priorities, indexing and filtering
0:21 processes within the databases, et cetera?
0:25 OK.
0:25 So that’s basically just like, tell me
0:27 everything about Google.
0:28 Right?
0:29 That’s a really expansive question.
0:30 It covers a lot of different ground.
0:32 And in fact, I have given orientation lectures to
0:35 engineers when they come in.
0:37 And I can talk for an hour about all those different
0:40 topics, and even talk for an hour about a very small subset
0:43 of those topics.
0:45 So let me talk for a while and see how much of a feel I can
0:48 give you for how the Google infrastructure works, how it
0:51 all fits together, how our crawling and indexing and
0:53 serving pipeline works.
0:55 Let’s dive right in.
0:57 So there’s three things that you really want to do well if
0:59 you want to be the world’s best search engine.
1:01 You want to crawl the web comprehensively and deeply.
1:03 You want to index those pages.
1:05 And then you want to rank or serve those pages and return
1:08 the most relevant ones first.
1:10 Crawling is actually more difficult
1:11 than you might think.
1:13 Whenever Google started, whenever I joined back in
1:16 2000, we didn’t manage to crawl the web for something
1:18 like three or four months.
1:20 And we had to have a war room.
1:22 But a good way to think about the mental model is we
1:25 basically take page rank as the primary determinant.
1:28 And the more page rank you have– that is, the more
1:31 people who link to you and the more reputable those people
1:34 are– the more likely it is we’re going to discover your
1:37 page relatively early in the crawl.
1:39 In fact, you could imagine crawling in strict page rank
1:41 order, and you’d get the CNNs of the world and The New York
1:45 Times of the world and really very high page rank sites.
1:49 And if you think about how things used to be, we used to
1:51 crawl for 30 days.
1:53 So we’d crawl for several weeks.
1:56 And then we would index for about a week.
1:59 And then we would push that data out.
2:01 And that would take about a week.
2:04 And so that was what the Google dance was.
2:05 Sometimes you’d hit one data center that had old data.
2:07 And sometimes you’d hit a data center that had new data.
2:10 Now there’s various interesting tricks
2:13 that you can do.
2:13 For example, after you’ve crawled for 30 days, you can
2:16 imagine recrawling the high page rank guys so you can see
2:19 if there’s anything new or important that’s hit on the
2:21 CNN home page.
2:22 But for the most part, this is not fantastic.
2:25 Right?
2:25 Because if you’re trying to crawl the web and it takes you
2:28 30 days, you’re going to be out-of-date.
2:30 So eventually, in 2003, I believe, we switched as part
2:36 of an update called Update Fritz to crawling a fairly
2:40 interesting significant chunk of the web every day.
2:43 And so if you imagine breaking the web into a certain number
2:47 of segments, you could imagine crawling that part of the web
2:51 and refreshing it every night.
2:53 And so at any given point, your main base index would
2:58 only be so out of date.
3:00 Because then you’d loop back around and you’d refresh that.
3:03 And that works very, very well.
3:04 Instead of waiting for everything to finish, you’re
3:06 incrementally updating your index.
3:08 And we’ve gotten even better over time.
3:10 So at this point, we can get very, very fresh.
3:14 Any time we see updates, we can usually
3:16 find them very quickly.
3:18 And in the old days, you would have not just a main or a base
3:20 index, but you could have what were called supplemental
3:24 results, or the supplemental index.
3:26 And that was something that we wouldn’t crawl and refresh
3:28 quite as often.
3:29 But it was a lot more documents.
3:31 And so you could almost imagine having really fresh
3:35 content, a layer of our main index, and then more documents
3:40 that are not refreshed quite as often, but there’s a lot
3:42 more of them.
3:43 So that’s just a little bit about the crawl and how to
3:45 crawl comprehensively.
3:47 What you do then is you pass things around.
3:49 And you basically say, OK, I have crawled a large fraction
3:53 of the web.
3:54 And within that web you have, for example, one document.
3:58 And indexing is basically taking things in word order.
4:04 Well, let’s just work through an example.
4:06 Suppose you say Katy Perry.
4:10 In a document, Katy Perry appears right
4:13 next to each other.
4:14 But what you want in an index is which documents does the
4:18 word Katy appear in, and which documents does the word
4:20 Perry appear in?
4:22 So you might say Katy appears in documents 1, and 2, and 89,
4:26 and 555, and 789.
4:32 And Perry might appear in documents number 2, and 8, and
4:37 73, and 555, and 1,000.
4:42 And so the whole process of doing the index is reversing,
4:47 so that instead of having the documents in word order, you
4:50 have the words, and they have it in document order.
4:53 So it’s, OK, these are all the documents that a
4:54 word appears in.
4:56 Now when someone comes to Google and they type in Katy
4:59 Perry, you want to say, OK, what documents might match
5:02 Katy Perry?
5:03 Well, document one has Katy, but it doesn’t have Perry.
5:06 So it’s out.
5:08 Document number two has both Katy and Perry, so that’s a
5:11 possibility.
5:12 Document eight has Perry but not Katy.
5:15 89 and 73 are out because they don’t have the right
5:18 combination of words.
5:19 555 has both Katy and Perry.
5:22 And then these two are also out.
5:25 And so when someone comes to Google and they type in
5:27 Chicken Little, Britney Spears, Matt Cutts, Katy
5:29 Perry, whatever it is, we find the documents that we believe
5:32 have those words, either on the page or maybe in back
5:35 links, in anchor text pointing to that document.
5:38 Once you’ve done what’s called document selection, you try to
5:41 figure out, how should you rank those?
5:43 And that’s really tricky.
5:44 We use page rank as well as over 200 other factors in our
5:49 rankings to try to say, OK, maybe this document is really
5:52 authoritative.
5:53 It has a lot of reputation because it has
5:55 a lot of page rank.
5:56 But it only has the word Perry once.
5:58 And it just happens to have the word Katy somewhere else
6:01 on the page.
6:02 Whereas here is a document that has the word Katy and
6:04 Perry right next to each other, so there’s proximity.
6:07 And it’s got a lot of reputation.
6:09 It’s got a lot of links pointing to it.
6:12 So we try to balance that off.
6:13 You want to find reputable documents that are also about
6:16 what the user typed in.
6:18 And that’s kind of the secret sauce, trying to figure out a
6:20 way to combine those 200 different ranking signals in
6:23 order to find the most relevant document.
6:25 So at any given time, hundreds of millions of times a day,
6:30 someone comes to Google.
6:32 We try to find the closest data center to them.
6:34 They type in something like Katy Perry.
6:36 We send that query out to hundreds of different machines
6:38 all at once, which look through their little tiny
6:41 fraction of the web that we’ve indexed.
6:43 And we find, OK, these are the documents that
6:45 we think best match.
6:47 All those machines return their matches.
6:49 And we say, OK, what’s the creme de la creme?
6:52 What’s the needle in the haystack?
6:53 What’s the best page that matches this query across our
6:56 entire index?
6:57 And then we take that page and we try to show it with a
7:00 useful snippet.
7:01 So you show the key words in the context of the document.
7:03 And you get it all back in under half a second.
7:06 So that’s probably about as long as we can go on without
7:10 straining YouTube.
7:11 But that just gives you a little bit of a feel about how
7:13 the crawling system works, how we index documents, how things
7:16 get returned in under half a second through that massive
7:19 parallelization.
7:20 I hope that helps.
7:21 And if you want to know more, there’s a whole bunch of
7:23 articles and academic papers about Google, and page rank,
7:26 and how Google works.
7:28But you can also apply to–
7:30there’s jobs@google.com, I think, or google.com/jobs, if
7:34you’re interested in learning a lot more about how search
7:36engines work.
7:37OK.
7:37Thanks very much.
7:39

Related Topics: Channel: Consumer | Google: Web Search

Sponsored


About The Author: is Search Engine Land's News Editor and owns RustyBrick, a NY based web consulting firm. He also runs Search Engine Roundtable, a popular search blog on very advanced SEM topics. Barry's personal blog is named Cartoon Barry and he can be followed on Twitter here. For more background information on Barry, see his full bio over here.

Connect with the author via: Email | Twitter | Google+ | LinkedIn



SearchCap:

Get all the top search stories emailed daily!  

Share

Other ways to share:
 

Read before commenting! We welcome constructive comments and allow any that meet our common sense criteria. This means being respectful and polite to others. It means providing helpful information that contributes to a story or discussion. It means leaving links only that substantially add further to a discussion. Comments using foul language, being disrespectful to others or otherwise violating what we believe are common sense standards of discussion will be deleted. Comments may also be removed if they are posted from anonymous accounts. You can read more about our comments policy here.
  • http://www.antonkoekemoer.com/ Anton Koekemoer

    Nice explanation on how Google search works.

  • http://www.wordpresssites.net/wordpress/ BradleyDalton.com

    Has he finally worked it out? 

  • http://www.loudcow.com.au Catie Hughes

    Mmmmmm – I could listen to Matt Cutts for hours. What a great summation – all in 8 minutes – impressive!

  • phale2000

    Cutts definitely has a way of explaining things so it is understandable — unlike most of Google’s explanations, he makes it all sound simple

  • Eye Paq

    It looks simple because his not saying to much :)
    But overall this one is a great beginner guide on how things works.

  • http://twitter.com/aestarrocker aestar rocker

    a great guide , hope to hear some updates from cutts whenever updates are there

  • Мартин Пацеков

    Great content. Actually, these all the major, let’s say basics of Google, but it’s still a very good reminder of how Google works.

  • http://www.cbil360.com/ Website Design Company

    Hey Matt,
      It has been discovered that Google gives priority to content rich websites, but what about for sites of organization website or website of any company, especially medium and small scale industry. As a search engine do not you think they must be given good exposure? Can you give some tips for them.

  • http://www.silkstream.net/ Hayley (Silkstream)

     ”Chicken Little, Britney Spears, Matt Cutts, Katy Perry” Almost genius. Although It’s a fine line separating intelligence and insanity.

  • Rajesh Magar

    Thanks Good one

  • Rajesh Magar

    Well this one is helpful too.
    But here’s the best video to know all about it. (From +Matt Cutts only)http://www.youtube.com/watch?v=BNHR6IQJGZs 

  • http://www.netlawman.com.au/partnership-agreements-australia partnership agreement

    Good Read! Finally Some One From Google Defined Google Working Mechanics in Deep Detail! Though There are Lot of Question Still needed to be answered in efficient manner, For Instance How Search Engine Determines Content Quality, What Sort of Back Links got Value from Google? Role and Importance of Social Media While Optimizing a Site??

  • http://www.socialcubix.com/services/facebook/application-development Facebook Applications

    He hasn’t really explained in detail, some statements are still vague and unclear.

  • TaylorMiles

    I’m sorry seriously guys….you re post a video of matt with a transcript that could have been done by a computer?  this is worth of searchenglineland post? 

  • http://twitter.com/OptumusAnalytic Optumus Analytics

    This is great and informative. 

Get Our News, Everywhere!

Daily Email:

Follow Search Engine Land on Twitter @sengineland Like Search Engine Land on Facebook Follow Search Engine Land on Google+ Get the Search Engine Land Feed Connect with Search Engine Land on LinkedIn Check out our Tumblr! See us on Pinterest

 
 

Click to watch SMX conference video

Join us at one of our SMX or MarTech events:

United States

Europe

Australia & China

Learn more about: SMX | MarTech


Free Daily Search News Recap!

SearchCap is a once-per-day newsletter update - sign up below and get the news delivered to you!

 


 

Search Engine Land Periodic Table of SEO Success Factors

Get Your Copy
Read The Full SEO Guide