Head-To-Head: ACAP Versus Robots.txt For Controlling Search Engines
In the battle between search engines and some mainstream news publishers, ACAP has been lurking for several years. ACAP — the Automated Content Access Protocol — has constantly been positioned by some news executives as a cornerstone to reestablishing the control they feel has been lost over their content. However, the reality is that publishers […]
In the battle between search engines and some mainstream news publishers, ACAP has been lurking for several years. ACAP — the Automated Content Access Protocol — has constantly been positioned by some news executives as a cornerstone to reestablishing the control they feel has been lost over their content. However, the reality is that publishers have more control even without ACAP than is commonly believed by some. In addition, ACAP currently provides no “DRM” or licensing mechanisms over news content. But the system does offer some ideas well worth considering. Below, a look at how it measures up against the current systems for controlling search engines.
ACAP started development in 2006 and formally launched a year later with version 1.0 (see ACAP Launches, Robots.txt 2.0 For Blocking Search Engines?). This year, in October, ACAP 1.1 was released and has been installed by over 1,250 publishers worldwide, says the organization, which is backed by the European Publishers Council, the World Association of Newspapers and the International Publishers Association.
If that sounds pretty impressive, hang on. I’ll provide a reality check in a moment. But first, let’s pump ACAP up a bit more. Remember back in July, when the Hamburg Declaration was signed by about 150 European publishers? The short declaration basically said that intellectual property protection needs to be increased on the internet, in order to protect high-quality journalism.
ACAP: Save Our Content!
Enter ACAP, as a lynchpin to achieving the Hamburg Declaration’s dream. From the official release put out by the European Publishers Council, which organized the declaration:
We need search engines to recognize ACAP as a step towards acknowledging that content providers have the right to decide what happens to their content and on what terms. The European Commission and other legislators call on our industry constantly to come up with solutions – here we have one and we call upon the regulators to back it up.
That quote is from Gavin O’Reilly, president of the World Association of Newspapers and News Publishers, Group CEO of Independent News & Media and chairman of ACAP.
To me, it reads like it’s the Wild West out on the internet. That search engines are doing whatever they want with content, with publishers having no control over what happens. ACAP would bring rules to the search engines, and those rules would have the force of law if some governmental bodies would force them on the search engines.
The Wild West Is Actually Tame
The reality is that the search engines do follow rules, ones they’ve created and enhanced over the past 15 years based on feedback from the entire web community (rather than from a select group of largely disgruntled news publishers). Moreover, in all this time there have been relatively few lawsuits over how search engines interact with news content. Only one really stands out in my mind, the case won by Belgian newspapers over being included in Google News.
It was an unnecessary lawsuit. The papers could have stayed out of Google News using existing controls. In fact, despite “winning” the lawsuit, the papers eventually sought reinclusion in Google News using existing standards (see Belgian Papers Back In Google; Begin Using Standards For Blocking).
Meet REP: On The Beat For 15 Years
What are those existing standards? Collectively, they’re called the “Robots Exclusion Protocol” or REP for short. REP is made up of:
- Robots.txt: created in 1994 as a way to block content on a server-wide basis using a single file (the robots.txt file)
- Meta Robots Tag: created in 1996 as a system to block on a page-by-page basis (see Meta Robots Tag 101: Blocking Spiders, Cached Pages & More for more about it)
The two standards both live at robotstxt.org, but they’ve never been updated there, nor is there any type of official group or organization behind REP. Instead, search engines have either unilaterally or collectively expanded what REP can do over the years. They serve as the de facto bosses of REP, Google in particular. If Google makes a change, other search engines often mimic it.
I used “robots.txt” in the headline of this article mainly because that’s often used by those who live and breathe this stuff as a common name for both parts of REP. But I’ll be sticking with REP for the rest of this article.
Some ACAP In Action
Enough of the preamble and background. Let’s roll our sleeves up and see how the two system compare, starting with something easy. How can you block ALL your pages from ALL search engines using REP? You’d make a two line robots.txt file like this:
User-agent: * Disallow: /
How would you do it in ACAP? Again, just two lines:
ACAP-crawler: * ACAP-disallow-crawl: /
Sounds easy enough to use ACAP, right? Well, no. ACAP, in its quest to provide as much granularity to publishers as possible, offers what I found to be a dizzying array of choices. REP explains its parts on two pages. ACAP’s implementation guide alone (I’ll get to links on this later on) is 37 pages long.
But all that granularity is what publishers need to reassert control, right? Time for that reality check. Remember those 1,250 publishers? Google News has something like over 20,000 news publishers that it lists, so relatively few are using ACAP. ACAP also positions itself as (I’ve bolded some key parts):
an open industry standard to enable the providers of all types of content (including, but not limited to, publishers) to communicate permissions information (relating to access to and use of that content) in a form that can be readily recognized and interpreted by a search engine (or any other intermediary or aggregation service), so that the operator of the service is enabled systematically to comply with the individual publisher’s policies.
Well, anyone with a web site is a publisher, and there are millions of web sites out there. Hundreds of millions, probably. Virtually no publishers use ACAP.
Even ACAP Backers Don’t Use ACAP Options
Of course, there’s no incentive to use ACAP. After all, none of the major search engines support it, so why would most of these people do so. OK, then let’s look at some people with a real incentive to show the control that ACAP offers. Even if they don’t yet have that control, they can still use ACAP now to outline what they want to do.
Let’s start with the ACAP file for the Irish Independent. Don’t worry if you don’t understand it, just skim, and I’ll explain:
##ACAP version=1.0# Allow all User-agent: * Disallow: /search/ Disallow: /*.ece$ Disallow: /*startindex= Disallow: /*from=* Disallow: /*service=Print Disallow: /*action=Email Disallow: /*comment_form Disallow: /*r=RSSSitemap: http://www.independent.ie/sitemap.xml.gz# Changes in TrunkACAP-crawler: * ACAP-disallow-crawl: /search/ ACAP-disallow-crawl: /*.ece$ ACAP-disallow-crawl: /*startindex= ACAP-disallow-crawl: /*from=* ACAP-disallow-crawl: /*service=Print ACAP-disallow-crawl: /*action=Email ACAP-disallow-crawl: /*comment_form ACAP-disallow-crawl: /*r=RSS
OK, see that top part? Those are actually commands using the robots.txt syntax. They exist because if a search engine doesn’t understand ACAP, the robots.txt commands serve as backup. Basically those lines tell all search engines not to index various things on the site, such as print-only pages.
Now the second part? This is where ACAP gets to shine. It’s where the Irish Independent — which is part of the media group run by ACAP president Gavin O’Reilly — gets to express what they wish search engines would do, if they’d only recognize all the new powers that ACAP provides. And what do they do? EXACTLY the same blocking that they do using robots.txt.
So much for demonstrating the potential power of ACAP.
Well, how about the Wall Street Journal, backed by Rupert Murdoch, who’s been on an anti-Google bend of late. Same situation — the WSJ’s ACAP file is doing nothing more than what the robots.txt commands show. Actually, it does less. At least the robots.txt system allows for discovery of a sitemap file (more on this below).
How about the Denver Post? It doesn’t have an ACAP file, just plain old regular robots.txt file. Why’s that signficant? The CEO of the media company that owns the Denver Post — Dean Singleton — recently suggested he’d pull some of his content out of Google (see Hold On: Are More Papers Really Joining Murdoch’s Google Block Party?).
Singleton is also chairman of the Associated Press, which has been very anti-Google of late and which also is a backer of ACAP. So if ACAP allows the expression of control that publishers somehow don’t currently have, I’d expect the Denver Post to be among the poster children along side the Irish Independent and the Wall Street Journal.
Well, how about the Troy Daily News, which is one of the organizations that ACAP proudly lists as using its system. What’s happening with a rank-and-file publisher. From its ACAP file:
User-agent: * Disallow: /private.asp Disallow: /SiteImages/ Crawl-delay: 10 Request-rate: 1/10 # maximum rate is one page every 10 seconds Visit-time: 0500-0845 # (GMT) only visit between 1:00 AM and 3:45 AM EST#------------------------------------ ##ACAP version=1.0 ACAP-crawler: * ACAP-disallow-crawl: /private.asp ACAP-disallow-crawl: /SiteImages/ #-----------------------------------------
Again, ACAP isn’t being used to express anything more than what’s already indicated in the robot.txt commands (the first section). Again, robots.txt actually goes beyond, as there’s support for a “crawl-delay” directive that ACAP doesn’t have.
That “request-rate” and “visit-time” telling search engines only to come by in the early morning hours? Have a chuckle at that. None of the major search engines recognize those commands. Similarly, visit the Hilton.com robots.txt file where you’ll see a similar but totally unrecognized command: “Do not visit Hilton.com during the day!”
Side-By-Side, REP & ACAP
OK, so even though no one’s using the special ACAP controls, let’s at least look at some of the key features and see how special they supposedly are. The table below lays out what REP offers against with ACAP.
In parentheses, I’ve noted the key commands used in both systems, for the technically inclined. Links lead to more information, as appropriate. Further below the chart, I’ve added more explanations as necessary.
Because REP has been extended by the major search engines, I’ve counted some areas as “Yes” for support if at least Google provides an option (given it has the largest marketshare of all). I’ve also noted the situation with Bing. As Yahoo search technology is slated to be acquired by Bing, I didn’t itemize its control offerings, since these will
For specific technical details on ACAP, see technical documents here. The easiest to comprehend is the implementation guide of Oct. 13, 2009. Also see the two crawler communication parts, if you want to dive in further.
|Block all search engines||Yes|
|Block specific search engines|
(for example, block Google but not Bing)
|Block crawling of all pages||Yes|
|Block crawling of specific pages||Yes|
|Block crawling of specific sections of web site||Yes|
(directory matching & named “resource sets”)
|Block crawling via pattern matching or “wildcards”||Yes: Google & Bing|
(see note, below)
(use block crawling commands)
(Indexing content different than what human visitors see)
(Google views cloaking as spam;
Bing frowns upon it but doesn’t ban just for cloaking)
|Block following links|
(doesn’t prevent finding links in other ways)
(nofollow, meta tag only)
|Block making cached pages||Yes|
(noarchive, meta tag only)
|Block showing cached pages||Yes|
(use block making cached pages command)
|Block snippets / descriptions / quotes||Yes: Google|
(nosnippet, meta tag only)
No: REP & Bing
|Set maximum length for snippets||No|
(At Bing, nopreview meta tag blocks hover preview)
|Set exact snippet to be used||Partial|
(meta description tag)
|Block thumbnail images||Partial|
(just block images)
|Block link to site||Yes: Google|
(noindex, meta tag only)
No: REP & maybe Bing
|Prevent format conversion|
(say HTML to PDF)
|Prevent translation||Yes: Google|
(notranslate, meta tag only)
No: REP & probably Bing
|Prevent annotations such as ratings||No||Yes|
|Urgent page removal||No||Yes|
|Block specific parts of a page||No|
(though Yahoo has robots-nocontent attribute)
(indicate places like specific IP addresses or countries where content can be listed)
(permittedcountrylist & others)
(such as remove after set number of days)
(unavailable_after, meta tag only)
No: REP & Bing
| Canonical Tag|
(Indicate “main URL” to be used in cases of same content on multiple URLs)
(Provide list of all URLs to be crawled)
|http x-robots tags|
(attach blocking to file headers, not within files)
(Bing might inherit Yahoo Search Monkey)
(slows crawling speed for slow servers)
(Google: crawl-rate option in Webmaster Central;
Bing: crawl-delay meta tag
(remove tracking that can cause duplicate content issues)
(Yahoo also offers)
OK, that’s the big chart. As you can see, there are some things that both systems provide and some things that are unique to each one. Here’s my personal take on the differences:
Jeers To ACAP!
Block Indexing: ACAP makes a weird distinction between blocking crawling (a search engine literally going from page to page automatically) and indexing (a search engine making a copy of the page, so that it can be added to a searchable index). For the major search engines, crawling and indexing are one and the same. I struggle to see an advantage to separating these out.
Cloaking: Those savvy to search engines know that Google hates cloaking, which is the act of showing a search engine something different than a human being would see. It’s often associated with spam. There are plenty of cases where people have shown misleading content to a search engine, in hopes of getting a good ranking. One example is from 1999, when the FTC took action against a site that was cloaking content that ranked for “innocent” searches like Oklahoma tornadoes and instead directed them to porn sites. The idea of a publisher forcing a search engine to allow cloaking would be somewhat similar to a newspaper being forced to write whatever a subject demanded be written about them.
Exact Snippet To Be Used: Similar to cloaking, allowing site owners to say whatever they want about a page sounds great if you’re an honest site owner. When you’re a search engine that knows how people will mislead, it’s not so appealing. In addition, sometimes it’s helpful to create a description that shows what someone searches for in context — and that doesn’t always happen with a publisher-defined description.
Annotation Blocking: It’s hard to interpret how this would work. Is Google’s SideWiki an annotation system, where comments are left alongside a publisher’s content but in a separate window? Or does this mean annotations on Google itself, such as SearchWiki allows? And should publishers be allowed to block people commenting about their pages on other sites? Does that block places like Yelp from reviewing businesses, if they link to them? Is Digg a ratings service? This option is a minefield.
Kudos To ACAP!
Maximum Snippet Length: Search engines are quoting more and more material from pages these days, it seems. The ability to limit how much they can use seems like a good idea that should be considered.
Meta Tag Only Commands: A number of controls such as blocking caching or snippets can’t be done in a single file. Now, for those using CMS systems, including free ones like WordPress, it’s relatively easy to add these codes to each and every page. But it would be nice to see the search engines add file-wide support for some of these options in the way that ACAP does.
Prevent Framing: I hate framing. I’d love to see a way to tell automated tools like URL shorteners that they can’t frame. But with the search engines, framing is pretty limited. Google does it with images, and you can block images from being indexed period, which eliminates framing. It does the same with cached pages, and you can block caching. Plus, it’s fairly easy for a site owner to block any type of framing.
Urgent Removal: If you’re a site owner, a system to get pages out of an index in a guaranteed period of time would be very convenient. However, this is probably better handled through webmaster tools that the search engines offer, as they allow a site owner to proactively trigger a removal, rather than waiting for visit from a crawler, which could take days. Ironically, at Google, they had a system to remove pages quickly. I wrote about it two years ago (see Google Releases Improved Content Removal Tools). But the documentation today is terrible. Little is explained if you’re not logged in. If you are logged in, the link for the webmaster version doesn’t work. The entire feature Google described in 2007 is gone.
Block Specific Parts Of Page: Who wants all their navigation being indexed, along with all the other crud pages often have on them? ACAP allows for only parts of a page to be indexed. Yahoo already offers this. Why not the others?
Permitted Places: The idea here is that you could allow your story to be listed in Google UK but not Google France, if you wanted. It might not be a bad idea, though that’s not usually the demand I tend to hear. Instead, site owners often are trying to figure out how to associate their sites with a particular country (Google has a tool for this).
Time Limit: You can time restrict when a page should be removed, a cached copy should be removed and more with ACAP. Google has some support here, though few use it, the search engine tells me. It also seems unnecessary. It seems far more efficient for a site owner to simply remove their own content from the web or block spidering, when ready. In either case, that causes it to drop from a search engine.
Jeers To Search Engines!
I think the biggest frustration in compiling this article was knowing that search engines do offer much control to publishers but finding the right documentation is hard. At Google, you can block translation, but it was difficult to find this page in the help pages offered to site owners. Bing has a way to block previews, but I couldn’t locate this within its help center. Google has a blog post saying that bing supports the nosnippets tag. Over at Bing, I couldn’t find this documented. FYI, Jane & Robot has a good guide that can help with those trying to understand all that’s allowed.
Cheers To Search Engines!
ACAP has focused on a publisher wishlist of options that often can be done in other ways. Don’t want thumbnail images in a search engine? OK, we’ll make a command, even though just blocking images would solve that problem.
In contrast, the search engine have added feature that have come from the outcries of many diverse site owners. Sitemaps, to provide a list of URLs for indexing. Crawl delay support. Richer snippets. Duplicate content tools, such as the canonical tag or parameter consolidation. They deserve far more credit than some news publishers give to them.
The Missing Part: Licensing
Did you catch the biggest option that ACAP does NOT provide? There’s no licensing support.
Remember how ACAP’s O’Reilly talked about how ACAP was needed to ensure “that content providers have the right to decide what happens to their content and on what terms.” ACAP really doesn’t provide that much more control than what’s out there now. It doesn’t give publishers significantly more “rights.” I mean, how many more “rights” can you have when you’ve got the nuclear option of fully withdrawing from a search engine at any time?
It’s the part I bolded that’s key, the “what terms” portion. ACAP is supposed to somehow support new business models. Part of the idea is that you might want to license your headlines to one search engine, your thumbnails to another, and this would all be bundled up in some partnership deal. To quote from the ACAP FAQ:
Business models are changing, and publishers need a protocol to express permissions of access and use that is flexible and extensible as new business models arise. ACAP will be entirely agnostic with respect to business models, but will ensure that revenues can be distributed appropriately. ACAP presents a win win for the whole online publishing community with the promise of more high quality content and more innovation and investment in the online publishing sector. ACAP is for the large as well as the small and even the individuals. It will benefit all content providers whether they are working alone or through publishers. A future without publishers willing and able to invest in high quality content and get a return on that investment is a future without high-quality content on the net.
Nothing in the ACAP specs I’ve gone through provide any type of revenue distribution mechanism, much less some type of automated handshake between a publisher and a search engine to verify permissions.
If REP & ACAP Files Could Talk
To illustrate this better, here’s a “real world” conversation of how ACAP supposedly works. I shared this recently with others on the Read 2.0 mailing list that I’m part of during a discussion, playing off some other conversation scenes that John Mark Ockerbloom had started. Several people said they found it helpful. Perhaps you will, too.
Hi! I’m Google. Can you tell me if I can crawl your site?
Sure, but I might have some restrictions over what you can do.
That’s cool. Just use a meta robots tag on particular pages to give me specific commands.
Well, on this page, I don’t want you to show a cached copy.
Awesome, use the noarchive command. Done. What’s next?
On this page, you must always show the description I want shown for it.
Use the meta description tag. We’ll consider that, but we can’t promise.
Dammit. You just want to rule the world.
Look, we build description that are related to what someone searched for, dynamically. So if we find a page on your site, in response to a particular keyword, sometimes it makes sense to “snip” a description that contains that term from your page, so they immediately understand why your page is relevant to your search. And click to view it. That’s why we call them snippets.
Dammit. Do what I want. You’re not the boss of me.
Well, we also get people who would say they have children’s games when instead, they have adult games — like porn. Seriously, true story. Plus, we’re the boss of us. I mean, is it OK if we declare that you must review us in the way we want in your publication.
Let’s move on. On this page, I don’t want any images to be used.
Block them with robots.txt. Done.
This article, I only want you to list it for 30 days.
Pull it down after 30 days. Or move the full article to a new location, and leave a summary page up, if you want remnant traffic. Or block it. Or use the unavailable-after meta tag.
I only want you to list this content if I have a paid partnership with you. My ACAP file will declare that to you.
You have a paid partnership with us?
Well, not yet. But Murdoch’s promising us that will come.
If you have a paid partnership with us, to give us permission to index your content, we know that internally. I mean, we don’t have many of those, and we’re not scanning the web and ACAP files to keep track of them. ACAP doesn’t even have a place for you to tell us this, anyway.
I don’t have a partnership. But I’m saying you should only index my content if you DO have a partnership. But you keep indexing it.
Well, then block us. Surely you know if we don’t have a partnership or not. And you can use robots.txt to authorize indexing all you want.
But I want you to license our content!
Yeah, we get that. Hey, check it out, have you see our free wifi at airports?
ACAP Not A Business Solution; Search Engines, Get Organized!
Overall, there are some ideas in ACAP that would be useful for the search engines to consider. However, there are many ideas outside of ACAP that would also be useful for them to consider. There’s nothing I see within ACAP that provides some type of crucial control that if only news publishers had, all their online woes would be over. What the news publishers really want are licensing agreements, and given that Google already has several of these without using ACAP (see Josh Cohen Of Google News On Paywalls, Partnerships & Working With Publishers), I can’t see that having it somehow advances any business model changes.
Certainly the search engines need to get their act together more, however. It’s time to stop referring people to the REP site which is run by no one. It’s time to stop having a myriad of help pages scattered about within their respective sites. Yes, they should continue to have their own help pages (see Google’s webmaster help from here; Bing’s from here). But I’d like to see Google and Microsoft take the lead to also consolidate material into a common site, perhaps building off Sitemaps.org.