Google’s New Indexing Infrastructure “Caffeine” Now Live
Google first mentioned their new indexing infrastructure, Caffeine, back in August 2009 in order to solicit feedback, then launched it at one data center in November. Finally, it’s live everywhere. The Google blog calls it a “whole new web indexing system” that’s “more than 50 percent fresher than our last index and it’s the largest […]
Google first mentioned their new indexing infrastructure, Caffeine, back in August 2009 in order to solicit feedback, then launched it at one data center in November. Finally, it’s live everywhere. The Google blog calls it a “whole new web indexing system” that’s “more than 50 percent fresher than our last index and it’s the largest collection of web content we’ve offered”.
So what is Caffeine and what does its launch mean for searchers and content owners?
Maile Ohye, of Google’s Webmaster Central told me “the entire web is expanding and evolving and Caffeine means that we can better evolve with it. As the ecosystem improves, we improve too and return more relevant content to searchers.” Google’s Matt Cutts added that “Caffeine benefits both searchers and content owners because it means that all content (and not just content deemed “real time”) can be searchable within seconds after its crawled.”
Caffeine is a revamp of Google’s indexing infrastructure. It is not a change to Google’s ranking algorithms. It is live across all data centers, regions, and languages.
Content is available to searchers more quickly
Previously, Google’s crawling and indexing systems worked as batch processes. Googlebot would crawl a set of pages, then process those pages (extracting content from them, associating data about them, such as anchor text and external links, determining what those pages were about), and finally add them to the index. While this system was continuous, all the documents in the batch had to wait until the whole batch was processed to be pushed live. Now, when Google crawls a page, it processes that page through the entire indexing pipeline and pushes it live nearly instantly. This change has already resulted in a 50 percent fresher index than before.
Note that the introduction of Caffeine doesn’t necessarily mean that pages will be crawled on a faster schedule than before. It simply means that once those pages are crawled, they are made available to searchers much more quickly. (Remember, you can estimate how often your pages are crawled by taking a look at your server logs or checking the cache dates in Google.)
Google’s storage capacity has greatly increased
While Google’s index is not significantly larger than before at the moment, the new indexing infrastructure makes that possible. Which only makes sense. If Caffeine is intended to help Google better evolve as the web does, then it needs significant storage capacity. The web grows by leaps and bounds every day, certainly much faster than anyone could have imagined back when Google first launched.
Google’s flexibility in storing information about documents has greatly increased
Google has always associated a variety of details with documents it stores. (In this context, a “document” refers to any piece of web content, such as a web page, image, or video.) For instance, when Google indexes a web page, it also stores information about what external pages link to that page and what anchor text is used in those links. The Caffeine infrastructure provides more flexibility in the type of details that can be stored with a document. As the web changes and new valuable data about web content emerges, Google won’t have to build new code to take advantage of it. This means that while Caffeine itself is not a ranking algorithm change, it could impact ranking in the future (as new signals are associated with pages).
Matt Cutts told me “It’s important to realize that caffeine is only a change in our indexing architecture. What’s exciting about Caffeine though is that it allows easier annotation of the information stored with documents, and subsequently can unlock the potential of better ranking in the future with those additional signals.”
Update: In Matt’s keynote at SMX Advanced, he gave an example of additional data that Google can now store for documents. He said, “you might imagine that before we could associate a page with only one country, whereas now, we could potentially associate that page with several countries”. (Note that he wasn’t saying this was something that Google does now; just that it was an example of what is possible with the new infrastructure.)
How can content owners best take advantage of the new infrastructure?
Content owners will reap the benefits of Caffeine without doing anything at all. In fact, there’s really not much, if anything content owners can do. Some may wonder if this change means that existing best practices around crawl efficiency matter more than before. Is page speed, which Google has focused on more lately, more important? Nope. Google told me that this change doesn’t make any of the crawling, indexing, or ranking factors more or less important than before. It simply makes crawled content available in search results more quickly before and paves the way for added flexibility in taking advantage of the whatever may come as the web evolves.
Opinions expressed in this article are those of the guest author and not necessarily Search Engine Land. Staff authors are listed here.