Back in November 2009, Google News announced they were “in the midst of an exciting transition period” that included a change to the News Sitemap Protocol. News publishers have through April 2010 to modify their News Sitemap to accommodate the new format. What’s so exciting and transitional? I asked Google, thinking that they were changing the protocol to prepare for some exciting new things in Google News. I was a bit disappointed in the answer, then, when they told me the exciting transition was simply the change to the protocol itself.
The changes do make things a bit easier for News publishers though in a couple of ways:
- You can now reference your News Sitemap in your robots.txt file or ping Google with its location, rather than submitting via Google Webmaster Tools (I would still recommend submitting via Webmaster Tools the first time for the benefit of the parsing error information)
- You can now combine articles of multiple types into one News Sitemap. Previously, you had to separate them by genre, access level, and language.
Since you’ve only got a few months left modify the scripts that create your News Sitemaps, I thought now would be a good time to recap just what the changes entail and highlight some of the other changes for Google News publishers with the recent addition of the Googlebot-News user agent.
Remember that you can only submit a News Sitemap to Google if you’re already included in the Google News program. Once you are included in the program, News Sitemaps are a vital way to ensure Google crawls and indexes all of your news articles in a timely manner. If you’re already submitted a News Sitemap, once you make the protocol changes, you’ll need to resubmit it. Make sure everything’s in place before April! Google has an FAQ about the transition that might be helpful.
News Sitemap Protocol Changes
The changes to the protocol are straightforward. Previously, when you submitted a News Sitemap to Google, you selected a publication label from a menu. Now, you simply add this label information directly to the Sitemap (the publication label no longer exists in the Google Webmaster Tools UI). This change paves the way for the two other changes mentioned above.
The new tags (which are child tags of the URL, so therefore should be specified for each news article) are:
Includes child tags for the name of the publication and the language of the article. The name tag must match the way the publication name appears in articles, and the language tag should use the ISO 639 language code.
Comma-separated list of one or more of the following accepted values (if applicable): PressRelease, Satire, Blog, OpEd, Opinion, or UserGenerated. Note that several of these are new. Google told me:
The only change here is to simplify the process for our publishers and improve the accuracy of our labeling. The <genre> tag is meant to differentiate between different content types, many of which we use to label articles. Those include press releases, satire, and subscription or registration content. In the past, separate sitemaps had to be submitted for all press releases, satire articles and blog articles from a particular site. This tag allows publishers to submit only one sitemap, and to use those labels on a per-article basis. We have also added additional tags, including op-ed, opinion, user-generated and blog.
Google uses these labels to identify content to readers. For instance, Google News started labeling “blogs” for news readers back in September 2009.
Subscription or Registration (if applicable).
Optional tag that enables you to specify the title of the article. You might want to use his tag if Google has had trouble extracting the correct title from your articles in the past.
You can find a full explanation of the current version of the protocol in Google’s help center. An entry in the new Sitemap format might look a little something like this:
<url> <loc>http://www.example.com/tv-news/new-buffy-series.php</loc> <n:news> <n:publication> <n:name>Vanessa's World Of Buffy News</n:name> <n:language>en</n:language> </n:publication> <n:access>subscription</n:access> <n:genres>pressrelease, blog</n:genres> <n:publication_date>2010-01-31</n:publication_date> <n:title>Whedon Confirms New Buffy The Vampire Slayer Series</n:title> <n:keywords>untruth, wishful thinking, crazy talk</n:keywords> </n:news> </url>
Of the new tags only <publication> tag is required. Google says the <genres> and <access> tags are required only when they apply, which seems to me to be the definition of optional. I asked Google if they enforce use of them, but it sounds like policies haven’t changed to pay closer attention than before to how things are labeled.
Google News user agent (Googlebot-News), cloaking, first-click free, and subscription content
In December 2009, Google launched a separate user agent for Google News. Previously, Google used the same user agent to crawl for both the news index and web index. As noted in Google’s blog post this change enables news publishers to opt out of being in Google News, but still be found in web search. (Previously, a publisher had to opt either in or out to both.)
Because of the new user agent, and the addition of the <access> tag, I wondered if Google was making any changes to its first-click free program. It always seemed a bit odd to me that while Google was vehemently anti-cloaking, their methods for enabling your subscription-based content into Google News (with the appropriate label) meant that it also had to pontential to be indexed for web search (because the user agent was the same). They told me they had nothing to report at the time, but they have since changed the help center content that describes how to make first-click free and subscription-based content available for indexing in Google News.
The help content now says (about ensuring subscription content can be indexed by Google News):
The easiest way to do this is to configure your webservers to not serve the registration when our crawler visits your pages (when the User-Agent is “Googlebot-News”). It is equally important that your robots.txt file allows access by Googlebot-News.
It previously said:
“The easiest way to do this is to configure your webservers to not serve the registration when our crawlers visit your pages (when the User-Agent is “Googlebot”).”
Google also recently modified its First-Click Free program to enable publishers to limit free access.
Google News recrawls
Just last week, Google announced that they’ll now recrawl articles for Google News. This is great for publishers, as before this, once your article was published and Google had crawed it for the News index, that was it. No changes would ever be reflected in Google News. This could make Google News seems a bit more like print than part of the dynamically changing web. Fortunately, this has changed and Google will now recheck articles it’s already crawled for any changes. Just as it’s always done for its web index.
For more information on gaining visibility in Google News, see:
- Googler Maile Ohye’s Video Tips For News Publishers
- Google News Blog: Tips For Helping Google News Crawl Your Site
- Google News Blog: Myths and Truths
- Under the Hood: Google News and Ranking Stories
- Danny Sullivan interviews Google News’s Josh Cohen
- Google News Help Center
- Search Engine Land Google News Library