Today, Google announced that they will no longer be crawling news sites with Googlebot-News and instead will crawl news sites with Googlebot, the same bot that crawls sites for web search. However, you can still block your content from being indexed in Google News by disallowing Googlebot-News in robots.txt or using a meta robots tag.
Blocking Content From Google News
Seem confusing? On the one hand, it’s not at all.
If you want Google to index your content in both web search and News (if you are a Google News publisher), then you don’t need to do anything. Google will keep crawling as it always has, but if you look at your server logs, you’ll only see entries for Googlebot rather than entries for both Googlebot and Googlebot-News.
If you want to keep your content out of Google News, you can keeping using the Disallow directive in robots.txt (or meta robots tag) to block Googlebot-News. Even though Google will now crawl as Googlebot rather than Googlebot-News, they’ll still respect the Googleb0t-News robots.txt directive.
You can no longer, however, disallow Googlebot and allow Googlebot-News as you can for other specialized Googlebots, although you could before this change.
Gathering Data About How Your Site Is Crawled
On the other hand, this change makes things a lot more confusing if you’re using data to understand how your site is crawled and make improvements.
For instance, if you notice that your news articles aren’t being indexed in Google News and you check the news-specific crawl errors in Google Webmaster Tools and don’t see any problems, you can no longer check your server logs to see if those articles are being crawled for the news index. You can see if the pages are being crawled generally, but this less granular insight makes it tougher to troubleshoot problems.
In this example, you may be generating a news-specific Sitemap and that generation process may be missing specific URLs. You used to be able to review your server logs, see that Googlebot-News was crawling particular URLs but not others, and then check to see if the URLs that hadn’t been crawled were in the Sitemap. Now, all the server logs will tell you is whether Google is crawling the URLs at all. If they are being crawled for web search but not News, that detail is now lost.
You lose granular insight for web search as well. If you are tracking down why particular pages on your site aren’t indexed, you could previously review your server logs to see if they were being crawled, but now it will appear as though they are, even if they are only being crawled for Google News.
You can still get News-specific and web-specific crawl errors from Google webmaster tools, so some insight is still available. In terms of granularity, Google tells me that the Google webmaster tools URLs restricted by robots.txt report includes only the pages blocked from web search and not URLs blocked from Google News.
However, It doesn’t sound like you can currently see a list of URLs Google tried to crawl but didn’t due to Googlebot-News being blocked, and unfortunately the robots.txt analysis tool in Google webmaster tools doesn’t let you test URLs blocked in Google News separately from web search. So it would be tough to determine if you were accidentally blocking URLs from indexing in Google News.
This change seems like a bit of a step backward to me. When Google News was first launched, Googlebot crawled for both web search and News and news publishers asked for a news-specific bot. Certainly, the most important reason for this is the ability to block and allow content from Google News separately from web search, and that functionality remains. However, the granular insight available was useful as well, and it’s unfortunate that will now be lost.