For years, Google’s discovery of web pages was solely based on links. If a page had no links to it, Googlebot had no way of knowing about it and therefore, would never index it. Along the way, Google provided an option for submitting individual pages, but that wasn’t really a viable option for site owners with large sites. In 2005, Google launched XML Sitemaps, which was a much more scalable way for site owners to let Google know about pages of their site that Googlebot may not otherwise discover through links. Today, a Google Webmaster Central blog post discusses another way Googlebot may discover pages: feeds. They say that using RSS and Atom feeds to discover pages helps them learn about new content quickly.
New content is key for Google since freshness is a vital component of relevance for some queries. Convention wisdom is that it’s not all that useful to ensure Google knows about pages of your site if they don’t have links to them, because without links, Google won’t see them as valuable. But current ranking is much more complicated than the original PageRank formula describes. And new content with no links may very well trump content with an abundance of links if it makes sense for the query.
Of course, site owners have always been able to to submit RSS and Atom feeds as Sitemaps, but this post describes using these feeds even if the site owner hasn’t submitted them via the Sitemap system. Instead, they are scanning other feed submission systems, such as Google Reader and ping services for the feeds.
It’s unclear from the post if the feeds are being used solely for discovery or if the content from the feeds are being used in place of crawling as well. The title of the post references “discovery” but the post itself notes that they are able to “get these new pages into our index more quickly than traditional crawling methods” and to directly crawl feeds. If Google is using the feeds in place of crawling, this would be another argument in favor of full rather than partial feeds — you’d get more of a page’s content indexed more quickly. Google Blogsearch initially crawled feed content rather than the actual pages, which led to partial indexing in Blogsearch, but this changed late last year.
The post notes that in order for Google to use a feed as a discovery method, the feed must not be blocked by robots.txt.