ACAP Launches, Robots.txt 2.0 For Blocking Search Engines?

After a year of discussions, ACAP — Automated Content Access Protocol — was released today as a sort of robots.txt 2.0 system for telling search engines what they can or can’t include in their listings. However, none of the major search engines support ACAP, and its future remains firmly one of "watch and see." Below, more about the how and why of ACAP.

Let’s start with some history. ACAP got going in September 2006, backed by major European newspaper and publishing groups that in particular felt Google was using content without proper permissions and wanting a more flexible means to provide this than allowed by the long-standing robots.txt and meta robots standards.

These two standards are found at the robotstxt.org, and ACAP has been referring to them often at "Robots Exclusion Protocol" or REP, though within the SEO world, they’re generally known by their actual names.

Robots.txt was born in 1994 as a way to block content on a server-wide basis; meta robots emerged in 1996 as a system to block on a page-by-page basis (see Meta Robots Tag 101: Blocking Spiders, Cached Pages & More for more about it). Neither has been updated since those years ago, in terms of search engines coming together to agree on new universal standards. In short, REP has no "guardians" or group to take it forward.

Enter ACAP. If the search engines weren’t going to improve robots.txt, the aforementioned publishers decided they’d take on the challenge. Of course, creating a standard for search engine indexing is kind of a waste of time, if you don’t have the search engines themselves to actually support it. But ACAP didn’t let that be a deterrent. Over the past year, it has had a working group setting up a new system, with search engines Google and Ask.com, along with Exalead, taking part in the discussions. FYI, I’ve not been an active working member, but I’ve been included on the working group’s emails and chimed in from time to time with advice and thoughts.

The ACAP System

Now the new system has arrived, being unveiled at the ACAP conference in New York today. Before getting into support, let’s cover what’s in it. You’ll find an overview page for the specifications here, which leads to:

  • A robots.txt-to-ACAP conversion tool (don’t worry; this should make your robots.txt file still work as a regular one and double as an ACAP file)
     
  • ACAP extensions to use with robots.txt (here, PDF file)
     
  • ACAP extensions to use with meta robots (here, PDF file)
     
  • ACAP logo for those that want to show they’re using ACAP (not required to make ACAP work, but expect publishers pushing ACAP to make use of it)

What does ACAP provide that robots.txt and meta robots does not? After going through the technical specs, which are pretty dense reading, I’d summarize it this way:

  • Emphasis on both granting permissions and blocking
     
  • Support for time-based inclusion or exclusion

That’s it. Discussions have covered concepts such as how password-protected content could be indexed, or whether you could issue permissions on a country-by-country basis, but some of these ideas haven’t made it into the first cut.

AP has a nice overview article about the ACAP launch, and I found the companion piece a nice summary if you’re looking for some faster specifics. A key part:

Some search engines have interpreted "disallow" to mean that the site cannot be added to the index but could be fetched for use in various algorithms employed to determine how high a site appears in search results….ACAP proposes to clarify that "disallow" refers to indexing.

A separate "crawl" command would be added to bar the indexing software or crawler entirely.

In addition, Web sites would be able to add qualifiers stipulating that the information expires from the search index on a specific date, in a given number of days or whenever the crawler returns to the site.

A "follow" command would permit or block the crawler from following links within a page.

"Preserve," with similar time limits available for "index," would stipulate whether a copy may be stored in a search engine’s cache.

"Present" would govern a search engine’s ability to display the copy, and a site may limit that further — for example, to a snippet or to a miniaturized version, or thumbnail.

As I said, there’s an emphasis on granting permission. By default, search engines assume everything is open to indexing. ACAP changes this assumption, asking those that create the files to explicitly indicate yes or no.

Should You Use It?

So now we have a new standard for expressing search engine permissions. Do site owners need to run out and immediately use it?

No. Not immediately. Not even long term.

Right now, none of the major search engines are supporting ACAP. If you were to use ACAP without ensuring that standard robots.txt or meta robots commands were also included, you’d fail to properly block search engines. Only Exalead, which is not a major multi-country service, would currently act upon your ACAP-only commands.

Even if ACAP were to magically get endorsed and supported by all the major search engines, robots.txt and meta robots support wouldn’t go away for many years. There are simply too many sites that use those systems, have used them for over a decade, and would fail to upgrade. Those two systems will continue to be supported in the same way Microsoft has had to support DOS programs despite the growth of Windows.

So why bother at all? Probably two reasons:

  • You want to personally test out how ACAP works, playing with the permissions and seeing what happens in Exalead
     
  • You want to support the ACAP system and hope that if enough people use it, perhaps the search engines will adopt it. FYI, ACAP is urging (PDF file) "universal adoption" by publishers by the end of next year.

Search Engine Support

What’s up with the major services? I emailed the big three, Google, Microsoft, and Yahoo, all of whom either took part in the working group or are at today’s conference. Google’s canned answer:

We are interested in all initiatives that allow web publishers and search engines to work more closely together. We have undertaken many efforts in this direction over the years including supporting file-extension and wildcard specifications in robots.txt, SiteMaps, our Webmaster Console, extending per-item indexing specification to non-html documents and specifying how long a url would be available. We will examine ACAP proposals when they become available. As a broad-based search engine, we need to keep in mind the needs of millions of web publishers worldwide.

As it happens, I was at Microsoft yesterday, and while I haven’t gotten a formal statement to post, the sentiment was the same as Google. Microsoft is interesting in supporting publishers, is continuing to grow its own tools and will also watch ACAP, wanting to support publishers in general

Yahoo’s not sent a statement back yet, but when it arrives, you can expect it will be pretty much the same as Google and Microsoft.

Why not just jump into ACAP? Between the lines time here — no one really wants to hand over control of the standard to the ACAP group, especially in my view when it has been born out of some anti-search engine hype.

So why not jump behind improving robots.txt and meta robots? Another issue here is that no one is officially in charge of those standards. The search engines are sort of the gatekeepers, because it’s what they decide to support that effectively becomes "law." If they don’t support a particular exclusion command, it might as well not exist.

The various search engines tell me they have been talking more about making some collective improvements. Individually, they’ve already added to both robots.txt and meta robots over the years, extensions that may work with their particular search engines. Perhaps they will become more unified.

In particular, they’ve united around the sitemaps standard. That sort of picks up what ACAP does in terms of being a system to provide express permission of indexing, and it’s where I’d expect any search engine-driven, collective agreement about improved blocking tools to emerge.

Also be sure to read Up Close & Personal With Robots.txt, which summarizes the second robots.txt summit that I organized earlier this year. The article covers a lot of things that general site owners and SEOs have wished for, along with some search engine responses.

Conclusion

So has the entire ACAP project been a waste of time, or as Andy Beal’s great headline put it when ACAP was announced last year, Publishers to Spend Half Million Dollars on a Robots.txt File? That still makes me laugh.

No, I’d say not. I think it’s been very useful that some group has diligently and carefully tried to explore the issues, and having ACAP lurking at the very least gives the search engines themselves a kick in the butt to work on better standards. Plus, ACAP provides some groundwork they may want to use. Personally, I doubt ACAP will become Robots.txt 2.0 — but I suspect elements of ACAP will flow into that new version or a successor.

Related Topics: Channel: SEO | SEO: Blocking Spiders

Sponsored


About The Author: is a Founding Editor of Search Engine Land. He’s a widely cited authority on search engines and search marketing issues who has covered the space since 1996. Danny also serves as Chief Content Officer for Third Door Media, which publishes Search Engine Land and produces the SMX: Search Marketing Expo conference series. He has a personal blog called Daggle (and keeps his disclosures page there). He can be found on Facebook, Google + and microblogs on Twitter as @dannysullivan.

Connect with the author via: Email | Twitter | Google+ | LinkedIn



SearchCap:

Get all the top search stories emailed daily!  

Share

Other ways to share:
 

Read before commenting! We welcome constructive comments and allow any that meet our common sense criteria. This means being respectful and polite to others. It means providing helpful information that contributes to a story or discussion. It means leaving links only that substantially add further to a discussion. Comments using foul language, being disrespectful to others or otherwise violating what we believe are common sense standards of discussion will be deleted. Comments may also be removed if they are posted from anonymous accounts. You can read more about our comments policy here.

Comments are closed.

Get Our News, Everywhere!

Daily Email:

Follow Search Engine Land on Twitter @sengineland Like Search Engine Land on Facebook Follow Search Engine Land on Google+ Get the Search Engine Land Feed Connect with Search Engine Land on LinkedIn Check out our Tumblr! See us on Pinterest

 
 

Click to watch SMX conference video

Join us at one of our SMX or MarTech events:

United States

Europe

Australia & China

Learn more about: SMX | MarTech


Free Daily Search News Recap!

SearchCap is a once-per-day newsletter update - sign up below and get the news delivered to you!

 


 

Search Engine Land Periodic Table of SEO Success Factors

Get Your Copy
Read The Full SEO Guide