It’s been noticed that new Associated Press stories — hosted by Google itself — are no longer appearing in Google News. It’s true. Since Dec. 24, Google has no longer added new AP content, something the company confirmed with me today. I received this statement:
We have a licensing agreement with the Associated Press that permits us to host its content on Google properties such as Google News. Some of that content is still available today. At the moment we’re not adding new hosted content from the AP.
So why not? The statement doesn’t explain. But it’s reasonable to assume it’s related to the ongoing talks between Google and the Associated Press.
Google has an agreement to host AP articles on its own web site, plus to make use of AP material in other ways. That expires near the end of this month. Since the agreement only allows stories to be hosted for 30 days, it might be that Google’s covering the legal bases in case a new agreement isn’t reached. You don’t want a story going up on, say January 23, only to have to pull it down the next day.
Google News Now Hosting Wire Stories & Promises Better Variety In Results from us in 2007 has background about how Google began hosting stories from several wire services, including the AP, on its own site.
The AP, in particular, wanted its stories hosted. Josh Cohen Of Google News On Paywalls, Partnerships & Working With Publishers from us last November explains more about this, how Google makes use of AP content under the current agreement and some of the issues that have come up in talks to strike a new deal.
It’s important to note that AP stories may still appear within Google. Many newspapers carry AP content, and those papers continue to be listed. So you can find AP stories hosted on newspaper sites. You just won’t find them hosted within Google itself.
On a somewhat related note, Rupert Murdoch’s News Corporation has threated to block major search engines, including Google, from crawling its news content. Some see the first shot in that threat being fired as UK-based The Times is now blocking the UK-based NewsNow search engine.
As it turns out, this seems largely unrelated to Murdoch’s complaints with Google. Instead, it focuses on NewsNow providing a commercial service that allows companies to monitor the news. The UK’s Newspaper Licensing Agency wants to charge companies that provide this type of service. While Murdoch’s Times hasn’t joined that push, which is under review by UK authorities, it has restricted NewsNow for the same reasons. PaidContent provides an excellent rundown on the situation in these articles:
- Murdoch Paper Blocks UK Aggregator Before Paywall Goes Up
- UK Newspapers Suspend ‘Link Tax’ Bills To End Users
Technically, NewsNow doesn’t have to obey the restrictions blocking it in the robots.txt file at The Times. It’s not a legally-binding protocol. But respected crawlers do obey it, which is one reason why there have been so few lawsuits over crawling.
That robots.txt file is farcical in one respect. At the top, it says this:
#Robots.txt File #Version: 0.8 #Last updated: 04/01/2010 #Site contents Copyright Times Newspapers Ltd #Please note our terms and conditions http://www.timesonline.co.uk/section/0,,497,00.html #Spidering is not allowed by our terms and conditions #Authorised spidering is subject to permission #For authorisation please contact us - see http://www.nisyndication.com/about_us.html
A robots.txt file is designed to be read automatically by a machine. Anything line that begins with a # symbol is a comment that machines ignore (the two indented lines above without a # symbol are actually part of the line above, just with a line break in this story, so you can read them properly). There’s no way for Google or any search engine to automatically know this file that “Authorised spidering is subject to permission.”
The way they actually know this is if the robots.txt file uses an accepted command to block them — which it doesn’t, except for NewsNow and some smaller crawlers.
I highly doubt Google, Yahoo or Bing have actually asked The Times for permission to crawl them. But I have an email out to The Times and News Corporation to find out. Of course, if any human had tried to follow the link to seek authorization, they wouldn’t have gotten an error. That link doesn’t work.