Search Engine Land » SEO » ACAP Launches, Robots.txt 2.0 For Blocking Search Engines?

ACAP Launches, Robots.txt 2.0 For Blocking Search Engines?

After a year of discussions, ACAP — Automated Content Access Protocol — was released today as a sort of robots.txt 2.0 system for telling search engines what they can or can’t include in their listings. However, none of the major search engines support ACAP, and its future remains firmly one of "watch and see." Below, […]

Danny Sullivan on November 29, 2007 at 12:02 pm | Reading time: 8 minutes

After a year of discussions, ACAP —
Automated Content Access Protocol — was released today as a sort of
robots.txt 2.0 system for telling search engines what they can or can’t include
in their listings. However, none of the major search engines support ACAP, and
its future remains firmly one of "watch and see." Below, more about the how and
why of ACAP.

Let’s start with some history. ACAP

got going in September 2006, backed by major European newspaper and
publishing groups that in particular felt Google was using content without
proper permissions and wanting a more flexible means to provide this than
allowed by the long-standing robots.txt and meta robots standards.

These two standards are found at the
robotstxt.org, and ACAP has been referring to them often at "Robots
Exclusion Protocol" or REP, though within the SEO world, they’re generally known
by their actual names.

Robots.txt was born in 1994 as a way to block content on a server-wide basis;
meta robots emerged in 1996 as a system to block on a page-by-page basis (see
Meta Robots Tag 101:
Blocking Spiders, Cached Pages & More for more about it). Neither has been
updated since those years ago, in terms of search engines coming together to
agree on new universal standards. In short, REP has no "guardians" or group to
take it forward.

Enter ACAP. If the search engines weren’t going to improve robots.txt, the
aforementioned publishers decided they’d take on the challenge. Of course,
creating a standard for search engine indexing is kind of a waste of time, if
you don’t have the search engines themselves to actually support it. But ACAP
didn’t let that be a deterrent. Over the past year, it has had a working group
setting up a new system, with search engines Google
and Ask.com, along with
Exalead, taking part in the discussions.
FYI, I’ve not been an active working member, but I’ve been included on the
working group’s emails and chimed in from time to time with advice and thoughts.

The ACAP System

Now the new system has arrived, being unveiled at the
ACAP conference in New York
today. Before getting into support, let’s cover what’s in it. You’ll find an
overview page for the specifications
here, which leads to:

A robots.txt-to-ACAP conversion
tool
(don’t worry; this should make your robots.txt file still work as a regular
one and double as an ACAP file)
ACAP extensions to use with robots.txt (here,
PDF file)
ACAP extensions to use with meta robots (here,
PDF file)
ACAP logo
for those that want to show they’re using ACAP (not required to make ACAP
work, but expect publishers pushing ACAP to make use of it)

What does ACAP provide that robots.txt and meta robots does not? After going
through the technical specs, which are pretty dense reading, I’d summarize it
this way:

Emphasis on both granting permissions and blocking
Support for time-based inclusion or exclusion

That’s it. Discussions have covered concepts such as how password-protected
content could be indexed, or whether you could issue permissions on a
country-by-country basis, but some of these ideas haven’t made it into the first
cut.

AP has a nice overview

article about the ACAP launch, and I found the companion

piece a nice summary if you’re looking for some faster specifics. A key
part:

Some search engines have interpreted "disallow" to mean that the site
cannot be added to the index but could be fetched for use in various
algorithms employed to determine how high a site appears in search
results….ACAP proposes to clarify that "disallow" refers to indexing.

A separate "crawl" command would be added to bar the indexing software or
crawler entirely.

In addition, Web sites would be able to add qualifiers stipulating that the
information expires from the search index on a specific date, in a given
number of days or whenever the crawler returns to the site.

A "follow" command would permit or block the crawler from following links
within a page.

"Preserve," with similar time limits available for "index," would stipulate whether a copy may be stored in a search engine’s cache.

"Present" would govern a search engine’s ability to display the copy, and a
site may limit that further — for example, to a snippet or to a miniaturized
version, or thumbnail.

As I said, there’s an emphasis on granting permission. By default, search
engines assume everything is open to indexing. ACAP changes this assumption,
asking those that create the files to explicitly indicate yes or no.

Should You Use It?

So now we have a new standard for expressing search engine permissions. Do
site owners need to run out and immediately use it?

No. Not immediately. Not even long term.

Right now, none of the major search engines are supporting ACAP. If you were
to use ACAP without ensuring that standard robots.txt or meta robots commands
were also included, you’d fail to properly block search engines. Only Exalead,
which is not a major multi-country service, would currently act upon your
ACAP-only commands.

Even if ACAP were to magically get endorsed and supported by all the major
search engines, robots.txt and meta robots support wouldn’t go away for many
years. There are simply too many sites that use those systems, have used them
for over a decade, and would fail to upgrade. Those two systems will continue to
be supported in the same way Microsoft has had to support DOS programs despite
the growth of Windows.

So why bother at all? Probably two reasons:

You want to personally test out how ACAP works, playing with the
permissions and seeing what happens in Exalead
You want to support the ACAP system and hope that if enough people use it,
perhaps the search engines will adopt it. FYI, ACAP is

urging (PDF file) "universal adoption" by publishers by the end of next
year.

Search Engine Support

What’s up with the major services? I emailed the big three, Google,
Microsoft, and Yahoo, all of whom either took part in the working group or are at
today’s conference. Google’s canned answer:

We are interested in all initiatives that allow web publishers and search
engines to work more closely together. We have undertaken many efforts in this
direction over the years including supporting file-extension and wildcard
specifications in robots.txt, SiteMaps, our Webmaster Console, extending
per-item indexing specification to non-html documents and specifying how long
a url would be available. We will examine ACAP proposals when they become
available. As a broad-based search engine, we need to keep in mind the needs
of millions of web publishers worldwide.

As it happens, I was at Microsoft yesterday, and while I haven’t gotten a
formal statement to post, the sentiment was the same as Google. Microsoft is
interesting in supporting publishers, is continuing to grow its own tools and
will also watch ACAP, wanting to support publishers in general

Yahoo’s not sent a statement back yet, but when it arrives, you can expect it
will be pretty much the same as Google and Microsoft.

Why not just jump into ACAP? Between the lines time here — no one really
wants to hand over control of the standard to the ACAP group, especially in my
view when it has been born out of some anti-search engine hype.

So why not jump behind improving robots.txt and meta robots? Another issue
here is that no one is officially in charge of those standards. The search
engines are sort of the gatekeepers, because it’s what they decide to support
that effectively becomes "law." If they don’t support a particular exclusion
command, it might as well not exist.

The various search engines tell me they have been talking more about making
some collective improvements. Individually, they’ve already added to both
robots.txt and meta robots over the years, extensions that may work with their
particular search engines. Perhaps they will become more unified.

In particular, they’ve united around the sitemaps
standard. That sort of picks up what ACAP
does in terms of being a system to provide express permission of indexing, and
it’s where I’d expect any search engine-driven, collective agreement about
improved blocking tools to emerge.

Also be sure to read
Up Close & Personal With Robots.txt, which summarizes the second robots.txt
summit that I organized earlier this year. The article covers a lot of things
that general site owners and SEOs have wished for, along with some search engine
responses.

Conclusion

So has the entire ACAP project been a waste of time, or as Andy Beal’s great
headline put it when ACAP was announced last year,

Publishers to Spend Half Million Dollars on a Robots.txt File? That still
makes me laugh.

No, I’d say not. I think it’s been very useful that some group has diligently
and carefully tried to explore the issues, and having ACAP lurking at the very
least gives the search engines themselves a kick in the butt to work on better
standards. Plus, ACAP provides some groundwork they may want to use. Personally,
I doubt ACAP will become Robots.txt 2.0 — but I suspect elements of ACAP will
flow into that new version or a successor.

Contributing authors are invited to create content for Search Engine Land and are chosen for their expertise and contribution to the search community. Our contributors work under the oversight of the editorial staff and contributions are checked for quality and relevance to our readers. The opinions they express are their own.

Add Search Engine Land to your Google News feed.