<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>searchengineland.com &#187; SEO: Blocking Spiders</title>
	<atom:link href="http://searchengineland.com/library/seo/seo-blocking-spiders/feed" rel="self" type="application/rss+xml" />
	<link>http://searchengineland.com</link>
	<description>Search Engine Land: Must Read News About Search Marketing &#38; Search Engines</description>
	<lastBuildDate>Sat, 21 Nov 2009 03:30:01 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>A Deeper Look At Robots.txt</title>
		<link>http://searchengineland.com/a-deeper-look-at-robotstxt-17573</link>
		<comments>http://searchengineland.com/a-deeper-look-at-robotstxt-17573#comments</comments>
		<pubDate>Thu, 16 Apr 2009 12:00:26 +0000</pubDate>
		<dc:creator>Stephan Spencer</dc:creator>
				<category><![CDATA[100% Organic]]></category>
		<category><![CDATA[How To: SEO]]></category>
		<category><![CDATA[SEO: Blocking Spiders]]></category>

		<guid isPermaLink="false">http://searchengineland.com/?p=17573</guid>
		<description><![CDATA[The Robots Exclusion Protocol (REP) is not exactly a complicated protocol and its uses are fairly limited, and thus it’s usually given short shrift by SEOs. Yet there’s a lot more to it than you might think. Robots.txt has been with us for over 14 years, but how many of us knew that in addition [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;"><a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fsearchengineland.com%2Fa-deeper-look-at-robotstxt-17573"><img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fsearchengineland.com%2Fa-deeper-look-at-robotstxt-17573" height="61" width="51" /></a></div><p>The Robots Exclusion Protocol (REP) is not exactly a complicated protocol and its uses are fairly limited, and thus it’s usually given short shrift by SEOs. Yet there’s a lot more to it than you might think. Robots.txt has been with us for over 14 years, but how many of us knew that in addition to the disallow directive there’s a noindex directive that Googlebot obeys? That noindexed pages don’t end up in the index but disallowed pages do, and the latter can show up in the search results (albeit with less information since the spiders can’t see the page content)? That disallowed pages still accumulate PageRank? That robots.txt can accept a limited form of pattern matching? That, because of that last feature, you can selectively disallow not just directories but also particular filetypes (well, file extensions to be more exact)? That a robots.txt disallowed page can’t be accessed by the spiders, so they can’t read and obey a meta robots tag contained within the page?</p>
<p>A robots.txt file provides critical information for search engine spiders that crawl the web. Before these bots (does anyone say the full word “robots” anymore?) access pages of a site, they check to see if a robots.txt file exists. Doing so makes crawling the web more efficient, because the robots.txt file keeps the bots from accessing certain pages that should not be indexed by the search engines.</p>
<p>Having a robots.txt file is a best practice. Even just for the simple reason that some metrics programs will interpret the 404 response to the request for a missing robots.txt file as an error, which could result in erroneous performance reporting. But what goes in that robots.txt file? That’s the crux of it.</p>
<p>Both robots.txt and robots meta tags rely on cooperation from the robots, and are by no means guaranteed to work for every bot. If you need stronger protection from unscrupulous robots and other agents, you should use alternative methods such as password protection. Too many times I’ve seen webmasters naively place sensitive URLs such as administrative areas in robots.txt. You better believe robots.txt is one of the hacker’s first ports of call—to see where they should break into.</p>
<p>Robots.txt works well for:</p>
<ul>
<li>Barring crawlers from non-public parts of your website</li>
<li>Barring search engines from trying to index scripts, utilities, or other types of code</li>
<li>Avoiding the indexation of duplicate content on a website, such as “print” versions of html pages</li>
<li>Auto-discovery of XML Sitemaps</li>
</ul>
<p>At the risk of being Captain Obvious, the robots.txt file must reside in the root of the domain and must be named &#8220;robots.txt&#8221; (all lowercase). A robots.txt file located in a subdirectory isn&#8217;t valid, as bots only check for this file in the root of the domain.</p>
<p>Creating a robots.txt file is easy. You can create a robots.txt file in any text editor. It should be an ASCII-encoded text file, not an HTML file.</p>
<p><strong>Robots.txt syntax</strong></p>
<ul>
<li>User-Agent: the robot the following rule applies to (e.g. &#8220;Googlebot,&#8221; etc.)</li>
<li>Disallow: the pages you want to block the bots from accessing (as many disallow lines as needed)</li>
<li>Noindex: the pages you want a search engine to block AND not index (or de-index if previously indexed). Unofficially supported by Google; unsupported by Yahoo and Live Search.</li>
<li>Each User-Agent/Disallow group should be separated by a blank line; however no blank lines should exist within a group (between the User-agent line and the last Disallow).</li>
<li>The hash symbol (#) may be used for comments within a robots.txt file, where everything after # on that line will be ignored. May be used either for whole lines or end of lines.</li>
<li>Directories and filenames are case-sensitive: “private”, “Private”, and “PRIVATE” are all uniquely different to search engines.</li>
</ul>
<p>Let’s look at an example robots.txt file. The example below includes:</p>
<ul>
<li>The robot called “Googlebot” has nothing disallowed and may go anywhere</li>
<li>The entire site is closed off to the robot called “msnbot”;</li>
<li>All robots (other than Googlebot) should not visit the /tmp/ directory or directories or files called /logs, as explained with comments, e.g., tmp.htm, /logs or logs.php.</li>
</ul>
<p><code>User-agent: Googlebot<br />
Disallow:
</code></p>
<p><code>User-agent: msnbot<br />
Disallow: /
</code></p>
<p><code># Block all robots from tmp and logs directories<br />
User-agent: *<br />
Disallow: /tmp/<br />
Disallow: /logs # for directories and files called logs
</code></p>
<p><strong>What should be listed on the User-Agent line?</strong> A user-agent is the name of a specific search engine robot. You can set an entry to apply to a specific bot (by listing the name) or you can set it to apply to all bots (by listing an asterisk, which acts as a wildcard). An entry that applies to all bots looks like this:</p>
<p><code>User-Agent: *</code></p>
<p>Major robots include: Googlebot (Google), Slurp (Yahoo!), msnbot (MSN), and TEOMA (Ask).</p>
<p>Bear in mind that a block of directives specified for the user-agent of Googlebot will be obeyed by Googlebot; but Googlebot will NOT ALSO obey the directives for the user-agent of * (all bots).</p>
<p><strong>What should be listed on the Disallow line?</strong> The disallow lists the pages you want to block. You can list a specific URL or a pattern. The entry should begin with a forward slash (/).</p>
<p>Examples:</p>
<ul>
<li>To block the entire site: <code>Disallow: /</code></li>
<li>To block a directory and everything in it: <code>Disallow: /private_directory/</code></li>
<li>To block a page: <code>Disallow: /private_file.html</code></li>
<li>To block a page and/or a directory named private: <code>Disallow: /private</code></li>
</ul>
<p>If you serve content via both http and https, you’ll need a separate robots.txt file for each of these protocols. For example, to allow robots to index all http pages but no https pages, you’d use the robots.txt files as follows, for your http protocol:</p>
<p><code>User-agent: *<br />
Disallow: </code></p>
<p>And for the https protocol:</p>
<p><code>User-agent: *<br />
Disallow: /</code></p>
<p>Bots check for the robots.txt file each time they come to a website. The rules in the robots.txt file will be in effect immediately once it is uploaded to the site’s root and the bot comes to the site. How often it is accessed varies on how frequently the bots spider the site based on popularity, authority, and how frequently content is updated. Some sites may be crawled several times a day while others may only be crawled a few times a week. Google Webmaster Central provides a way to see when Googlebot last accessed the robots.txt file.</p>
<p>I’d recommend using the robots.txt analysis tool in <a href="http://www.google.com/webmasters/">Google Webmaster Central</a> to check specific URLs to see if your robots.txt file allows or blocks them, see if Googlebot had trouble parsing any lines in your robots.txt file, and test changes to your robots.txt file.</p>
<p><strong>Some advanced techniques</strong></p>
<p>The major search engines have begun working together to advance the functionality of the robots.txt file. As alluded to above, there are some functions that have been adopted by the major search engines, and not necessarily all of the major engines, that provide for finer control over crawling. As these may be limited though, do exercise caution in their use.</p>
<p><strong>Crawl delay:</strong> Some websites may experience high amounts of traffic and would like to slow search engine spiders down to allow for more server resources to meet the demands of regular traffic. Crawl delay is a special directive recognized by Yahoo, Live Search, and Ask that instructs a crawler on the number of seconds to wait between crawling pages:</p>
<p><code>User-agent: msnbot<br />
Crawl-delay: 5</code></p>
<p><strong>Pattern matching:</strong> At this time, pattern matching appears to be usable by the three majors: Google, Yahoo, and Live Search. The value of pattern matching is considerable. Let’s look first at the most basic of pattern matching, using the asterisk wildcard character. To block access to all subdirectories that begin with &#8220;private&#8221;:</p>
<p><code>User-agent: Googlebot<br />
Disallow: /private*/</code></p>
<p>You can match the end of the string using the dollar sign ($). For example, to block URLs that end with .asp:</p>
<p><code>User-agent: Googlebot<br />
Disallow: /*.asp$</code></p>
<p>Unlike the more advanced pattern matching found in regular expressions in Perl and elsewhere, the question mark does not have special powers. So, to block access to all URLs that include a question mark (?), simply use the question mark (no need to &#8220;escape&#8221; it or precede it with a backslash):</p>
<p><code>User-agent: *<br />
Disallow: /*?*</code></p>
<p>To block robots from crawling all files of a specific file type (for example, .gif):</p>
<p><code>User-agent: *<br />
Disallow: /*.gif$</code></p>
<p>Here&#8217;s a more complicated example. Let’s say your site uses the query string part of the URLs (what follows the “?”) solely for session IDs, and you want to exclude all URLs that contain the dynamic parameter to ensure the bots don’t crawl duplicate pages. But you may want to include any URLs that end with a &#8220;?&#8221;. Here’s how you’d accomplish that:</p>
<p><code>User-agent: Slurp<br />
Disallow: /*? 		# block any URL that includes a ?<br />
Allow: /*?$ 		# allow any URL that ends in a ?</code></p>
<p><strong>Allow directive:</strong> At this time, the Allow directive appears to only be supported by Google, Yahoo, and Ask. Just as it sounds, it works the opposite of the Disallow directive and provides the ability to specifically call out directories or pages that may be crawled. This may be beneficial after large sections or the entire site has been disallowed.</p>
<p>To allow Googlebot into only the &#8220;google&#8221; directory:</p>
<p><code>User-agent: Googlebot<br />
Disallow: /<br />
Allow: /google/</code></p>
<p><strong>Noindex directive:</strong> As mentioned above, this directive offers benefits in eliminating snippetless title-less listings from the search results, but it’s limited to Google. Its syntax exactly mirrors Disallow. In the words of <a href="http://www.mattcutts.com/blog/google-noindex-behavior/">Matt Cutts</a>:</p>
<blockquote><p>&#8220;Google allows a NOINDEX directive in robots.txt and it will completely remove all matching site URLs from Google. (That behavior could change based on this policy discussion, of course, which is why we haven’t talked about it much.)&#8221;</p></blockquote>
<p><strong>Sitemap:</strong> An XML sitemap file can tell search engines about all the pages on your site, and optionally, to provide information about those pages, such as which are most important and how often they change. It acts as an auto-discovery mechanism for the spider to find the XML sitemap file. You can tell Google and other search engines about your Sitemap by adding the following line to your robots.txt file:</p>
<p><code>Sitemap: sitemap_location</code></p>
<p>The sitemap_location should be the complete URL to the Sitemap, such as: http://www.example.com/sitemap.xml. This directive is independent of the user-agent line, so it doesn’t matter where you place it in your file. All major search engines support the Auto-Discovery Sitemap protocol, including Google, Yahoo, Live Search, and Ask.</p>
<p>While auto-discovery provides a way to inform search engines about the sitemap.xml file, it’s also worthwhile verifying and submitting sitemaps directly to the search engines through each of their webmaster consoles (Google Webmaster Central, Yahoo Site Explorer, Live Search Webmaster Center).</p>
<p><strong>More about Google’s bots</strong></p>
<p>Google uses several different bots (user-agents). The bot for web search is Googlebot. Google&#8217;s other bots follow rules you set up for Googlebot, but you can set up additional rules for these specific bots as well. Blocking Googlebot blocks all bots that begin with &#8220;Googlebot&#8221;.</p>
<p>Here’s a list of Google robots:</p>
<ul>
<li>Googlebot: crawls pages from web index and news index</li>
<li>Googlebot-Mobile: crawls pages for mobile index</li>
<li>Googlebot-Image: crawls pages for image index</li>
<li>Mediapartners-Google: crawls pages to determine AdSense content, only crawls sites if show AdSense ads</li>
<li>Adsbot-Google: crawls to measure AdWords landing page quality, only crawls sites that use Google AdWords to advertise</li>
</ul>
<p>You can block Googlebot entirely by using:</p>
<p><code>User-agent: Googlebot<br />
Disallow: /</code></p>
<p>You can allow Googlebot, but block access to all other bots:</p>
<p><code>User-agent: *<br />
Disallow: /</code></p>
<p><code>User-agent: Googlebot<br />
Disallow:</code></p>
<p><strong>Issues with robots.txt</strong></p>
<p>Pages you block by using robots.txt disallows may still be in Google&#8217;s index and appear in the search results &#8212; especially if other sites link to them. Granted, a high ranking is pretty unlikely since Google can’t “see” the page content; it has very little to go on other than the anchor text of inbound and internal links, and the URL (and the ODP title and description if in ODP/DMOZ.) As a result, the URL of the page and, potentially, other publicly available information can appear in search results. However, no content from your pages will be crawled, indexed or displayed.</p>
<p>To entirely prevent a page from being added to a search engine’s index even if other sites link to it, use a &#8220;noindex&#8221; robots meta tag and ensure that the page is not disallowed in robots.txt. When spiders crawl the page, it will recognize the &#8220;noindex&#8221; meta tag and drop the URL from the index.</p>
<p><strong>Robots.txt and robots meta tag conflicts</strong></p>
<p>If the robots.txt file and robots meta tag instructions for a page conflict, bots follow the most restrictive. More specifically:</p>
<ul>
<li>If you block a page with robots.txt, bots will never crawl the page and will never read any robots meta tags on the page.</li>
<li>If you allow a page with robots.txt but block it from being indexed using a robots meta tag, Googlebot will access the page, read the meta tag, and subsequently not index it.</li>
</ul>
<p>While robots.txt files are to protect content on a site from being indexed, including a robots.txt file regardless is recommended as many robotic processes look for them and offering one can only expedite their procedures. Together, robots.txt and robots meta tags give you the flexibility to express complex access policies relatively easily:</p>
<ul>
<li>Removing an entire website or part of a website.</li>
<li>Avoiding indexation of images in Google Image Search and other image engines.</li>
<li>Avoiding indexation of duplicate content on a site.</li>
<li>Removing individual pages on a site using a robots Meta tag.</li>
<li>Removing cached copies and snippets using a robots Meta tag.</li>
</ul>
<p>Both robots.txt and robots meta tag rely on cooperation from the robots, and are by no means guaranteed to work for every robot. If you need stronger protection from robots and other agents, you should use alternative methods such as password protection.</p>
]]></content:encoded>
			<wfw:commentRss>http://searchengineland.com/a-deeper-look-at-robotstxt-17573/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Google&#8217;s Advice On Using The New Canonical Tag</title>
		<link>http://searchengineland.com/googles-advice-on-using-the-new-canonical-tag-16931</link>
		<comments>http://searchengineland.com/googles-advice-on-using-the-new-canonical-tag-16931#comments</comments>
		<pubDate>Fri, 13 Mar 2009 13:19:19 +0000</pubDate>
		<dc:creator>Barry Schwartz</dc:creator>
				<category><![CDATA[Google: SEO]]></category>
		<category><![CDATA[Google: Webmaster Central]]></category>
		<category><![CDATA[SEO: Blocking Spiders]]></category>
		<category><![CDATA[SEO: Duplicate Content]]></category>
		<category><![CDATA[SEO: Redirects & Moving Sites]]></category>
		<category><![CDATA[SEO: Submitting & Sitemaps]]></category>
		<category><![CDATA[SEO: Tagging]]></category>
		<category><![CDATA[SEO: Titles & Descriptions]]></category>

		<guid isPermaLink="false">http://searchengineland.com/?p=16931</guid>
		<description><![CDATA[A month ago, Google, Yahoo and Microsoft announced they will be supporting a new canonical tag that allows you to tell search engines that page X is a duplicate page to page Z.  In a way, it is a 301 redirect, without the physical redirect.
The tag is incredibly powerful, as are 301 redirects and [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;"><a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fsearchengineland.com%2Fgoogles-advice-on-using-the-new-canonical-tag-16931"><img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fsearchengineland.com%2Fgoogles-advice-on-using-the-new-canonical-tag-16931" height="61" width="51" /></a></div><p>A month ago, Google, Yahoo and Microsoft announced they will be supporting a new <a href="http://searchengineland.com/canonical-tag-16537">canonical tag</a> that allows you to tell search engines that page X is a duplicate page to page Z.  In a way, it is a 301 redirect, without the physical redirect.</p>
<p>The tag is incredibly powerful, as are 301 redirects and using this tag should be done with caution and slowly.  Matt Cutts posted a new video explaining how one should go about using this tag, being that it is so new.  Here is the video:</p>
<p><object width="560" height="340"><param name="movie" value="http://www.youtube.com/v/LnXponbEHjw&#038;hl=en&#038;fs=1"></param><param name="allowFullScreen" value="true"></param><param name="allowscriptaccess" value="always"></param><embed src="http://www.youtube.com/v/LnXponbEHjw&#038;hl=en&#038;fs=1" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="560" height="340"></embed></object></p>
]]></content:encoded>
			<wfw:commentRss>http://searchengineland.com/googles-advice-on-using-the-new-canonical-tag-16931/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Live Search Testing New Crawler; MSNBot/2.0b</title>
		<link>http://searchengineland.com/live-search-testing-new-crawler-msnbot20b-15816</link>
		<comments>http://searchengineland.com/live-search-testing-new-crawler-msnbot20b-15816#comments</comments>
		<pubDate>Fri, 12 Dec 2008 14:05:00 +0000</pubDate>
		<dc:creator>Barry Schwartz</dc:creator>
				<category><![CDATA[Microsoft: Bing]]></category>
		<category><![CDATA[SEO: Blocking Spiders]]></category>

		<guid isPermaLink="false">http://searchengineland.com/?p=15816</guid>
		<description><![CDATA[The Live Search Blog announced they are letting a new robot loose.  The new search engine crawler is named msnbot/2.0b and will be added to the army of current MSN spiders, currently named msnbot/1.1.  
The new spider is currently being tested but will ultimately replace the old spider.  The new spider will [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;"><a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fsearchengineland.com%2Flive-search-testing-new-crawler-msnbot20b-15816"><img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fsearchengineland.com%2Flive-search-testing-new-crawler-msnbot20b-15816" height="61" width="51" /></a></div><p>The Live Search Blog <a href="http://blogs.msdn.com/webmaster/archive/2008/12/11/another-crawler-in-your-logs.aspx">announced</a> they are letting a new robot loose.  The new search engine crawler is named msnbot/2.0b and will be added to the army of current MSN spiders, currently named msnbot/1.1.  </p>
<p>The new spider is currently being tested but will ultimately replace the old spider.  The new spider will respect the current robots.txt protocol set up for MSNBot, so no need to set up anything new in your robots.txt file.  In addition, Microsoft promised to crawl slowly in their msnbot/2.0b tests.</p>
<p><span id="more-15816"></span>MSNBot/1.1 is not that old. It was added back in <a href="http://searchengineland.com/msnbot-11-live-search-implements-a-more-efficient-crawl-13351.php">February</a> of this year and introduced HTTP compression and conditional gets.</p>
]]></content:encoded>
			<wfw:commentRss>http://searchengineland.com/live-search-testing-new-crawler-msnbot20b-15816/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Irony: If Google Can&#8217;t Reach Your Robots.txt File, It Might Not List Your Site</title>
		<link>http://searchengineland.com/irony-if-google-cant-reach-your-robotstxt-file-it-might-not-list-your-site-14223</link>
		<comments>http://searchengineland.com/irony-if-google-cant-reach-your-robotstxt-file-it-might-not-list-your-site-14223#comments</comments>
		<pubDate>Wed, 18 Jun 2008 12:34:17 +0000</pubDate>
		<dc:creator>Barry Schwartz</dc:creator>
				<category><![CDATA[Google: SEO]]></category>
		<category><![CDATA[SEO: Blocking Spiders]]></category>

		<guid isPermaLink="false">http://searchengineland.com/beta/irony-if-google-cant-reach-your-robotstxt-file-it-might-not-list-your-site-14223.php</guid>
		<description><![CDATA[
]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;"><a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fsearchengineland.com%2Firony-if-google-cant-reach-your-robotstxt-file-it-might-not-list-your-site-14223"><img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fsearchengineland.com%2Firony-if-google-cant-reach-your-robotstxt-file-it-might-not-list-your-site-14223" height="61" width="51" /></a></div><p>I <A href="http://www.seroundtable.com/archives/017444.html">reported</a> at the Search Engine Roundtable this morning that Google said if your robots.txt is unreachable, your site might not make it into the Google index. By unreachable, Google means that if your server simply times out and does not return any server response when Googlebot attempts to access your robots.txt file, then it might not include any of your pages in their index.</p>
<p>Googler John Mueller explained that Google tends to lean on the &#8220;safe&#8221; side when this situation pops up.  When I showed this to Danny, he felt it was ironic that if Google can&#8217;t read what you want to block, it might block everything.  But if you think about it, with all the <a href="http://searchengineland.com/070703-090525.php">legal woos</a> Google has to deal with about indexing content, should they risk indexing a site that might have a nofollow directive in their robots.txt file?</p>
<p><span id="more-14223"></span>
It is important to clarify that a robots.txt file is not required in order to be listed with Google. If you don&#8217;t have one and Google sees a normal server status response such as a 404 not found, all&#8217;s good. It&#8217;s only if Google asks for a robots.txt file and gets no response at all where this might be an issue. Rare case, but good to know.</p>
]]></content:encoded>
			<wfw:commentRss>http://searchengineland.com/irony-if-google-cant-reach-your-robotstxt-file-it-might-not-list-your-site-14223/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Everything You Wanted To Know About Blocking Search Engines</title>
		<link>http://searchengineland.com/everything-you-wanted-to-know-about-blocking-search-engines-14193</link>
		<comments>http://searchengineland.com/everything-you-wanted-to-know-about-blocking-search-engines-14193#comments</comments>
		<pubDate>Thu, 12 Jun 2008 14:14:02 +0000</pubDate>
		<dc:creator>Danny Sullivan</dc:creator>
				<category><![CDATA[SEO: Blocking Spiders]]></category>

		<guid isPermaLink="false">http://searchengineland.com/beta/everything-you-wanted-to-know-about-blocking-search-engines-14193.php</guid>
		<description><![CDATA[
]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;"><a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fsearchengineland.com%2Feverything-you-wanted-to-know-about-blocking-search-engines-14193"><img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fsearchengineland.com%2Feverything-you-wanted-to-know-about-blocking-search-engines-14193" height="61" width="51" /></a></div><p>Last week, the three major search engines
<a href="http://searchengineland.com/080603-121100.php">came together</a> to
say how they agree &#8212; and disagree &#8212; over the Robots Exclusion Protocol.
It&#8217;s such an important standard, one every webmaster should understand. To
help, Vanessa Fox has compiled an extensive and outstanding overview of it
at Jane &amp; Robot in her
<a href="http://janeandrobot.com/post/Managing-Robots-Access-To-Your-Website.aspx">
Managing Robot&#8217;s Access To Your Website</a> post.</p>
<p>The tutorial takes you through key areas such as:</p>
<p><span id="more-14193"></span></p>
<ul>
<li>A nice chart showing what you can block using either robots.txt or the
meta robots tag for each major search engine. It also covers other things
like reverse DNS lookup to verify a crawler&#8217;s identity.<br />
&nbsp;</li>
<li>Types of content you want private from search engines versus public.
Rather than private versus public, &quot;not listed&quot; versus &quot;listed&quot; might be
better terms Anything that really should be private ought to be kept
behind a password barrier. The tutorial does cover this, but it&#8217;s worth
stressing that no one should think robots exclusion is a method to keep
private/personally identifiable information out of search engines. But
there&#8217;s other info that you might want &quot;private&quot; in terms of not being
listed, such as printer-friendly pages, as the tutorial also explains.<br />
&nbsp;</li>
<li>How to block search engines, such as on a site-wide basis using
robots.txt, along with tips like using wildcards, specifying particular
search engines by crawler name. Page level blocking (with meta tags) is
also covered. There are lots of examples.<br />
&nbsp;</li>
<li>Common mistakes and myths are addressed, such as the idea that using
nofollow alone will keep pages from being indexed. Methods of testing
implementation are also covered.</li>
</ul>
<p>Bookmark the guide &#8212; it&#8217;s one you&#8217;ll want to come back to time and
again.</p>
]]></content:encoded>
			<wfw:commentRss>http://searchengineland.com/everything-you-wanted-to-know-about-blocking-search-engines-14193/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Yahoo!, Google, Microsoft Clarify Robots.txt Support</title>
		<link>http://searchengineland.com/yahoo-google-microsoft-clarify-robotstxt-support-14125</link>
		<comments>http://searchengineland.com/yahoo-google-microsoft-clarify-robotstxt-support-14125#comments</comments>
		<pubDate>Tue, 03 Jun 2008 16:11:00 +0000</pubDate>
		<dc:creator>Vanessa Fox</dc:creator>
				<category><![CDATA[SEO: Blocking Spiders]]></category>

		<guid isPermaLink="false">http://searchengineland.com/beta/yahoo-google-microsoft-clarify-robotstxt-support-14125.php</guid>
		<description><![CDATA[
]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;"><a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fsearchengineland.com%2Fyahoo-google-microsoft-clarify-robotstxt-support-14125"><img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fsearchengineland.com%2Fyahoo-google-microsoft-clarify-robotstxt-support-14125" height="61" width="51" /></a></div><p>Today, Google, Yahoo!, and Microsoft have come together to post details of how each of them support <a title="robots.txt" href="http://www.robotstxt.org" id="k_jr">robots.txt</a>  and the <a title="robots meta tag" href="http://searchengineland.com/070305-204850.php" id="srt1">robots meta tag</a>. While their posts use terms like &#8220;collaboration&#8221; and &#8220;working together,&#8221; they haven&#8217;t joined together to implement a new standard (as they did with sitemaps.org). Rather, they are simply making a joint stand in messaging that robots.txt is the standard way of blocking search engine robot access to web sites. They have identified a core set of robots.txt and robots meta tag directives that all three engines support:</p>
<p>Google and Yahoo! already supported and documented each of the core directives, and Microsoft supported most of them before this announcement. In their posts, they also list the directives they support that may not be supported by the other engines.</p>
<p><span id="more-14125"></span>
For robots.txt, they all support:</p>
<ul>
<li>Disallow</li>
<li>Allow</li>
<li>Use of wildcards</li>
<li>Sitemap location</li>
</ul>
<p>For robots meta tags, they all support:</p>
<ul>
<li>noindex</li>
<li>nofollow</li>
<li>noarchive</li>
<li>nosnippet
<li>noodpt</li>
</ul>
<p>With this announcement, Microsoft appears to be adding support for the use of * wildcards (which will go live later this month) and the Allow directive. The biggest discrepancy is with the crawl-delay directive. Yahoo! and Microsoft support it, while Google does not (although Google does support control of crawl speed via <a title="Webmaster Tools" href="http://www.google.com/webmasters" id="kvo4">Webmaster Tools</a> ). <br id="evmp0">
<br id="e0y40">
This isn&#8217;t the first time the major search engines have come together for an announcement regarding how they support publishers. In late 2006, all three joined together to support XML Sitemaps and launched <a title="sitemaps.org" href="http://sitemaps.org/" id="k972">sitemaps.org</a>, followed in April 2007 with support for <a title="Sitemaps autodiscovery" href="http://searchengineland.com/070411-080716.php" id="ba9j">Sitemaps autodiscovery</a> in robots.txt, and in February 2008 with more support for more <a title="flexible storage locations of Sitemap files" href="http://searchengineland.com/080227-211358.php" id="wirt">flexible storage locations of Sitemap files</a>. In early 2005, the engines declared <a title="support for the nofollow attribute" href="http://googleblog.blogspot.com/2005/01/preventing-comment-spam.html" id="n_gv">support for the nofollow attribute</a> on links (in an effort to combat comment spam).<br id="kz5l0">
<br id="kz5l1">
Why are the search engines coming together to talk about their varied support for traditional methods for blocking access to web content? A Microsoft spokesperson told me that while robots.txt has been the de facto standard for some time, the search engines had never come together to detail how they support it and said the aim is to &#8220;make REP more intuitive and friendly to even more publishers on the web.&#8221; Google similarly said that &#8220;doing a joint post allows webmasters to see how we all honor REP directives, the majority of which are identical, but we also call out those that are not used by all of us.&#8221;</p>
<p>Yahoo! told me:</p>
<blockquote><p>Our goal is to come out with clear information about the actual support around REP for all engines. We have all separately at different times reported our support and this creates a long trail hard for anyone to put together. Posting the same spec at the same time provides a sync point for everyone as to the actual similarities or differences between our implementations for all engines. We are trying to address the latent concerns around differences across the engines.</p></blockquote>
<p>Of course, each engine has provided documentation in their respective help centers for some time, and <a title="Google" href="http://www.google.com/webmasters" id="m1zi">Google</a> and <a title="Microsoft" href="http://webmaster.live.com" id="d.jf">Microsoft</a> provide robots.txt analysis tools that detail how they interpret a file in their webmaster tools, so while they haven&#8217;t documented their support jointly, the documentation itself isn&#8217;t new. <br id="knz_0">
<br id="knz_1">
This move may be an effort to show a consolidated front in light of the ongoing publisher attempts to create new search engine access standards with <a title="ACAP" href="http://searchengineland.com/071129-120258.php" id="ys:e">ACAP</a>. This direction reflects the ongoing direction of the messaging the search engines have had about ACAP. For instance, Rob Jonas, Google&#8217;s head of media and publishing partnerships in Europe, said in March that &#8220;the general view is that the robots.txt protocol provides everything that most publishers need to do.&#8221; <br id="jzy-0">
<br id="jzy-1">
For more information, see each engine&#8217;s blog posts (updated as their posts go live):<br id="f3x.0"></p>
<ul id="jzy-3">
<li id="jzy-4"><a title="Microsoft Live Seach Webmaster Blog" href="http://blogs.msdn.com/webmaster/archive/2008/06/03/robots-exclusion-protocol-joining-together-to-provide-better-documentation.aspx" id="b1..">Microsoft Live Seach Webmaster Blog</a> <br id="qr.u0">
</li>
<li><a href="http://googlewebmastercentral.blogspot.com/2008/06/improving-on-robots-exclusion-protocol.html">Google Webmaster Central Blog</a></li>
<li><a href="http://www.ysearchblog.com/archives/000587.html">Yahoo! Search Blog</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://searchengineland.com/yahoo-google-microsoft-clarify-robotstxt-support-14125/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Google Offers Robots.txt Generator</title>
		<link>http://searchengineland.com/google-offers-robotstxt-generator-13653</link>
		<comments>http://searchengineland.com/google-offers-robotstxt-generator-13653#comments</comments>
		<pubDate>Thu, 27 Mar 2008 21:39:46 +0000</pubDate>
		<dc:creator>Danny Sullivan</dc:creator>
				<category><![CDATA[Google: SEO]]></category>
		<category><![CDATA[Google: Webmaster Central]]></category>
		<category><![CDATA[SEO: Blocking Spiders]]></category>

		<guid isPermaLink="false">http://searchengineland.com/beta/google-offers-robotstxt-generator-13653.php</guid>
		<description><![CDATA[
]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;"><a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fsearchengineland.com%2Fgoogle-offers-robotstxt-generator-13653"><img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fsearchengineland.com%2Fgoogle-offers-robotstxt-generator-13653" height="61" width="51" /></a></div><p>Google&#8217;s
<a href="http://googlewebmastercentral.blogspot.com/2008/03/speaking-language-of-robots.html">
rolled out</a> a new tool at <a href="http://www.google.com/webmasters/">Google
Webmaster Central</a>, a robots.txt generator. It&#8217;s designed to allow site
owners to easily create a robots.txt file, one of the two main ways (along with
the <a href="http://searchengineland.com/070305-204850.php">meta robots tag</a>)
to prevent search engines from indexing content. Robots.txt generators aren&#8217;t
new. You can find many of them out there by searching. But this is the first
time a major search engine has provided a generator tool of its own.</p>
<p>It&#8217;s nice to see the addition. Robots.txt files aren&#8217;t complicated to create.
You can write them using a text editor such as notepad with just a few simple
commands. But they can still be scary or hard for some site owners to
contemplate.</p>
<p><span id="more-13653"></span></p>
<p>To access the tool, log-in to your
<a href="https://www.google.com/webmasters/tools/">Google Webmaster Tools</a>
account, then click on the Tools menu option on the left-hand side of the screen
after you select one of your verified sites. You&#8217;ll see a &quot;Generate robots.txt&quot;
link among the tool options. That&#8217;s what you want.</p>
<p>By default, the tool is designed to let you create a robots.txt file to allow
all robots into your site. That&#8217;s kind of odd. By default, all robots will come
into your site. If you want them, then there&#8217;s no need to have a robots.txt file
at all. It&#8217;s like pinning a note to your chest reminding yourself to breathe.
Promise, you&#8217;ll keep breathing even if you forget to look at the note.</p>
<p>Instead, you generally want to put up a robots.txt file to block crawling of
some type. I may dig into a future article to examine when you might want to mix
allow and disallow statements, but off the top of my head, there&#8217;s not a lot of reasons
to do so.</p>
<p>You can change the default option to &quot;Block all robots&quot; easily enough. Do
that, and you get the standard and familiar two line keep out code:</p>
<blockquote>
<p>User-Agent: *<br />
Disallow: /</p>
</blockquote>
<p>The first line &#8212; User-Agent &#8212; is how you tell particular spiders or robots
to pay attention to the following instructions. Using the wildcard &#8212; * &#8212; says
&quot;hey ALL spiders, listen up.&quot;</p>
<p>The second line says what they can&#8217;t access. In this case, the / means to not
spider anything within the web site. You know how pages within a web site all
begin domain/something, like this:</p>
<blockquote>
<p>http://website.com/page.html</p>
</blockquote>
<p>See that / between website.com and page.html? Technically, that slash is the
start of the URL. So if you disallow all pages beginning with a slash, you&#8217;re
blocking all pages within the entire site.</p>
<p>Let&#8217;s move on from our mini-robots.txt 101 course. Maybe you only want to
block Google. Well, the tool is supposed to make this type of thing easy, but I
was perplexed. Step one is to either allow or block ALL robots. Then in Step 2,
you decide if you want to block specific robots. So which do you go with in step
1, block all or none?</p>
<p>I figured you&#8217;d want to allow all robots, then believe the reassuring text
next to that option that said &quot;you can fine-tune this rule in the next step.&quot;
The problem is, I couldn&#8217;t. If I tried to block Googlebot, the instructions
didn&#8217;t change. If I tried to choose, say, Googlebot-Mobile, same thing.</p>
<p>Eventually, I figured it out. If you decide to block specific spiders, you
have to choose the spider, then specify also what you want to block in the
&quot;Files or directories&quot; box, such as a particular file or directory. So say I
kept all print-only versions of stories in a directory called /print. I&#8217;d enter
that directory to get this:</p>
<blockquote>
<p>User-Agent: *<br />
Allow: /</p>
<p>User-Agent: Googlebot<br />
Disallow: /print<br />
Allow: /</p>
</blockquote>
<p>The first part tells spiders they can access the entire site. As I said, this
is entirely unnecessary, but you get it anyway. The second part says that
Googlebot cannot access the /print area.</p>
<p>The tool lets you craft specific rules for these particular Google crawlers:</p>
<ul>
<li>Googlebot</li>
<li>Googlebot-Mobile</li>
<li>Googlebot-Image</li>
<li>Mediapartners-Google</li>
<li>Adsbot-Google</li>
</ul>
<p>I wish the names were accompanied by parenthesis quickly explaining what each
crawler does, and what blocking them will do, say, something like this:</p>
<ul>
<li>Googlebot-Mobile (allows or blocks content from Google mobile search)</li>
</ul>
<p>Instead, you have to look through the various
<a href="http://www.google.com/support/webmasters/bin/answer.py?answer=40360">
help files</a> to understand what each does. Ironically, the
<a href="http://searchengineland.com/070816-085858.php">older Analyze Robots.txt
tool</a> within Google Webmaster Tools DOES have these helpful explanations, so I
expect they&#8217;ll migrate over.</p>
<p>You can also use the tool to enter a name for another crawler. The problem
is, someone using this tool probably doesn&#8217;t know the crawler names out there
that they want to block. I&#8217;d have given Google serious kudos points if they added
some of the other major crawlers. But then again, if they had, no doubt someone
would have accused them of trying to get people to block other search engines :)</p>
<p>Another thing that would have been nice was if people could have pasted full
URLs into the box to have them converted. A site owner using this tool might not
realize they need to drop the domain portion of a URL to block a particular
page. But if you could paste something like this:</p>
<blockquote>
<p>http://website.com/page-i-want-to-block.html</p>
</blockquote>
<p>And have the tool automatically turn it into this:</p>
<blockquote>
<p>User-Agent: *<br />
Disallow: /page-i-want-to-block.html</p>
</blockquote>
<p>After you make your file, upload it to the root directory of your web site.
If you don&#8217;t know what that is, find someone who does! This is important. Google
allows for subdirectories of web sites to be registered within Google Webmaster
Tools. However, robots.txt files do NOT work on a subdirectory basis. They have
to go at the root level of a web site. If you don&#8217;t put them there, then you
won&#8217;t be preventing access to any part of the site. Remember, after you upload
to the root level, you can go back into Google Webmaster Tools and use that
aforementioned analysis tool to see if it is really blocking the pages you want
to keep out.</p>
<p>Overall, I&#8217;m glad to see the new tool, and I imagine it will improve more
over time to make it even more user friendly.</p>
<p>In related news, Google says that the Web Crawl diagnostics area now has a new
filter letting you see only web crawl errors related to sitemaps you&#8217;ve
submitted. Also, there have been some UI tweaks to the iGoogle gadgets from
Webmaster Central that were
<a href="http://searchengineland.com/080228-134802.php">rolled out</a> last
month.</p>
<p>For more about Google&#8217;s webmaster tools, be sure to check out the
<a href="http://www.google.com/webmasters/edu/quickstartguide/">quick start
guide</a> they offer and see our
<a href="http://searchengineland.com/lands/google-webmaster-central.php">Google
Webmaster Central archives</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://searchengineland.com/google-offers-robotstxt-generator-13653/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SEOs Want The NOINDEX Tag To Not Show A Page In The Index</title>
		<link>http://searchengineland.com/seos-want-the-noindex-tag-to-not-show-a-page-in-the-index-13448</link>
		<comments>http://searchengineland.com/seos-want-the-noindex-tag-to-not-show-a-page-in-the-index-13448#comments</comments>
		<pubDate>Mon, 25 Feb 2008 12:53:07 +0000</pubDate>
		<dc:creator>Barry Schwartz</dc:creator>
				<category><![CDATA[Google: SEO]]></category>
		<category><![CDATA[SEO: Blocking Spiders]]></category>

		<guid isPermaLink="false">http://searchengineland.com/beta/seos-want-the-noindex-tag-to-not-show-a-page-in-the-index-13448.php</guid>
		<description><![CDATA[
]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;"><a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fsearchengineland.com%2Fseos-want-the-noindex-tag-to-not-show-a-page-in-the-index-13448"><img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fsearchengineland.com%2Fseos-want-the-noindex-tag-to-not-show-a-page-in-the-index-13448" height="61" width="51" /></a></div><p>Matt Cutts of Google posted a <a href="http://www.mattcutts.com/blog/google-noindex-behavior/">blog entry</a> asking SEOs how they want Google to handle the NOINDEX meta tag.  If you use the NOINDEX meta tag now, Google won&#8217;t show the page in any way in the Google index &#8212; not even a &#8220;link only&#8221; listing.</p>
<p>Matt asks SEOs if this is what they want and the poll currently shows us that yes, SEOs want it this way.  Here are the current results, but the results may change over the course of the week:</p>
<p><span id="more-13448"></span>
How should Google treat the NOINDEX meta tag?</p>
<ul>
<li>240 say &#8220;Don&#8217;t show a page at all.&#8221;</li>
<li>24 say &#8220;Find some middle ground.&#8221;</li>
<li>23 say &#8220;Show a link to the page.&#8221;</li>
</ul>
<p><a href="http://searchengineland.com/070223-092620.php">Google Explains The NOINDEX, NOFOLLOW, NOARCHIVE &#038; NOSNIPPET META Tags</a> from last year has more about the various meta commands Google allows, and<a href="http://searchengineland.com/070305-204850.php"> Meta Robots Tag 101: Blocking Spiders, Cached Pages &#038; More</a> goes into depth about how each feature is used for particular search engines.</p>
]]></content:encoded>
			<wfw:commentRss>http://searchengineland.com/seos-want-the-noindex-tag-to-not-show-a-page-in-the-index-13448/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Yahoo Search Weather Update &amp; Support For X-Robots Tag</title>
		<link>http://searchengineland.com/yahoo-search-weather-update-support-for-x-robots-tag-12855</link>
		<comments>http://searchengineland.com/yahoo-search-weather-update-support-for-x-robots-tag-12855#comments</comments>
		<pubDate>Wed, 05 Dec 2007 18:08:05 +0000</pubDate>
		<dc:creator>Barry Schwartz</dc:creator>
				<category><![CDATA[SEO: Blocking Spiders]]></category>
		<category><![CDATA[Yahoo: Search]]></category>

		<guid isPermaLink="false">http://searchengineland.com/beta/yahoo-search-weather-update-support-for-x-robots-tag-12855.php</guid>
		<description><![CDATA[
]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;"><a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fsearchengineland.com%2Fyahoo-search-weather-update-support-for-x-robots-tag-12855"><img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fsearchengineland.com%2Fyahoo-search-weather-update-support-for-x-robots-tag-12855" height="61" width="51" /></a></div><p>The Yahoo Blog <a href="http://www.ysearchblog.com/archives/000508.html">issued</a> a weather report for changes to rankings in Yahoo Search, along with news that they are now supporting the X-Robots-Tag directive &#8212; a way to control indexing of content that cannot accept <a href="http://searchengineland.com/070305-204850.php">meta robots tags</a>.</p>
<p><span id="more-12855"></span>
Google also <a href="http://searchengineland.com/070717-111517.php">supports</a> X-Robots, which gives webmasters the ability to define robots.txt like rules within http headers, as opposed to just the META data within HTML pages.</p>
<p>Yahoo provided a few examples of how it can work:</p>
<ul>
<li>X-Robots-Tag: NOINDEX &#8212; If you don&#8217;t want to show the URL in the Yahoo! Search results. Note: We&#8217;ll still need to crawl the page to see and apply the tag, so if you don&#8217;t wish to have the page crawled, use robots disallow on robots.txt.</li>
<li>X-Robots-Tag: NOARCHIVE &#8212; If you don&#8217;t want to display the cache link in the search results page.</li>
<li>X-Robots-Tag: NOSNIPPET &#8212; If you don&#8217;t want to display a summary in the search results page.</li>
<li>X-Robots-Tag: NOFOLLOW &#8212; If you don&#8217;t want Yahoo! to crawl links in the page.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://searchengineland.com/yahoo-search-weather-update-support-for-x-robots-tag-12855/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>ACAP Launches, Robots.txt 2.0 For Blocking Search Engines?</title>
		<link>http://searchengineland.com/acap-launches-robotstxt-20-for-blocking-search-engines-12802</link>
		<comments>http://searchengineland.com/acap-launches-robotstxt-20-for-blocking-search-engines-12802#comments</comments>
		<pubDate>Thu, 29 Nov 2007 16:02:58 +0000</pubDate>
		<dc:creator>Danny Sullivan</dc:creator>
				<category><![CDATA[SEO: Blocking Spiders]]></category>

		<guid isPermaLink="false">http://searchengineland.com/beta/acap-launches-robotstxt-20-for-blocking-search-engines-12802.php</guid>
		<description><![CDATA[
]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;"><a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fsearchengineland.com%2Facap-launches-robotstxt-20-for-blocking-search-engines-12802"><img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fsearchengineland.com%2Facap-launches-robotstxt-20-for-blocking-search-engines-12802" height="61" width="51" /></a></div><p>After a year of discussions, ACAP &#8212; <a href="http://www.the-acap.org/">
Automated Content Access Protocol</a> &#8212; was released today as a sort of
robots.txt 2.0 system for telling search engines what they can or can&#8217;t include
in their listings. However, none of the major search engines support ACAP, and
its future remains firmly one of &quot;watch and see.&quot; Below, more about the how and
why of ACAP.</p>
<p><span id="more-12802"></span></p>
<p>Let&#8217;s start with some history. ACAP
<a href="http://www.epceurope.org/presscentre/archive/New_rights_management_pilot_imminent.shtml">
got going</a> in September 2006, backed by major European newspaper and
publishing groups that in particular felt Google was using content without
proper permissions and wanting a more flexible means to provide this than
allowed by the long-standing robots.txt and meta robots standards.</p>
<p>These two standards are found at the <a href="http://www.robotstxt.org/">
robotstxt.org</a>, and ACAP has been referring to them often at &quot;Robots
Exclusion Protocol&quot; or REP, though within the SEO world, they&#8217;re generally known
by their actual names.</p>
<p>Robots.txt was born in 1994 as a way to block content on a server-wide basis;
meta robots emerged in 1996 as a system to block on a page-by-page basis (see
<a href="http://searchengineland.com/070305-204850.php">Meta Robots Tag 101:
Blocking Spiders, Cached Pages &amp; More</a> for more about it). Neither has been
updated since those years ago, in terms of search engines coming together to
agree on new universal standards. In short, REP has no &quot;guardians&quot; or group to
take it forward.</p>
<p>Enter ACAP. If the search engines weren&#8217;t going to improve robots.txt, the
aforementioned publishers decided they&#8217;d take on the challenge. Of course,
creating a standard for search engine indexing is kind of a waste of time, if
you don&#8217;t have the search engines themselves to actually support it. But ACAP
didn&#8217;t let that be a deterrent. Over the past year, it has had a working group
setting up a new system, with search engines <a href="http://google.com">Google</a>
and <a href="http://ask.com">Ask.com</a>, along with
<a href="http://www.exalead.com/">Exalead</a>, taking part in the discussions.
FYI, I&#8217;ve not been an active working member, but I&#8217;ve been included on the
working group&#8217;s emails and chimed in from time to time with advice and thoughts.</p>
<p><b>The ACAP System</b></p>
<p>Now the new system has arrived, being unveiled at the
<a href="http://www.the-acap.org/conference.php">ACAP conference</a> in New York
today. Before getting into support, let&#8217;s cover what&#8217;s in it. You&#8217;ll find an
overview page for the specifications
<a href="http://www.the-acap.org/implement-acap.php">here</a>, which leads to:</p>
<ul>
<li><b>A robots.txt-to-ACAP conversion
<a href="http://www.the-acap.org/convert-robots-txt-to-acap.php">tool</a> </b>
(don&#8217;t worry; this should make your robots.txt file still work as a regular
one and double as an ACAP file)<br />
&nbsp;</li>
<li><b>ACAP extensions to use with robots.txt </b>(<a href="http://www.the-acap.org/project_documents/ACAP-TF-CrawlerCommunications-Part1-V1.0.pdf">here</a>,
PDF file)<br />
&nbsp;</li>
<li><b>ACAP extensions to use with meta robots</b> (<a href="http://www.the-acap.org/project_documents/ACAP-TF-CrawlerCommunications-Part2-V1.0.pdf">here</a>,
PDF file)<br />
&nbsp;</li>
<li><b><a href="http://www.the-acap.org/add-acap-enabled.php">ACAP logo</a>
</b>for those that want to show they&#8217;re using ACAP (not required to make ACAP
work, but expect publishers pushing ACAP to make use of it)</li>
</ul>
<p>What does ACAP provide that robots.txt and meta robots does not? After going
through the technical specs, which are pretty dense reading, I&#8217;d summarize it
this way:</p>
<ul>
<li>Emphasis on both granting permissions and blocking<br />
&nbsp;</li>
<li>Support for time-based inclusion or exclusion</li>
</ul>
<p>That&#8217;s it. Discussions have covered concepts such as how password-protected
content could be indexed, or whether you could issue permissions on a
country-by-country basis, but some of these ideas haven&#8217;t made it into the first
cut.</p>
<p>AP has a nice overview
<a href="http://ap.google.com/article/ALeqM5iMXdcInM2ce3lCUBo7a8MNE_QGgQD8T7E6R00">
article</a> about the ACAP launch, and I found the companion
<a href="http://ap.google.com/article/ALeqM5jTZkoUsEdzTvUcGl1XqPL4msh8RgD8T74P400">
piece</a> a nice summary if you&#8217;re looking for some faster specifics. A key
part:</p>
<blockquote>
<p>Some search engines have interpreted &quot;disallow&quot; to mean that the site
cannot be added to the index but could be fetched for use in various
algorithms employed to determine how high a site appears in search
results&#8230;.ACAP proposes to clarify that &quot;disallow&quot; refers to indexing. </p>
<p>A separate &quot;crawl&quot; command would be added to bar the indexing software or
crawler entirely. </p>
<p>In addition, Web sites would be able to add qualifiers stipulating that the
information expires from the search index on a specific date, in a given
number of days or whenever the crawler returns to the site. </p>
<p>A &quot;follow&quot; command would permit or block the crawler from following links
within a page. </p>
<p>&quot;Preserve,&quot; with similar time limits available for &quot;index,&quot; would stipulate whether a copy may be stored in a search engine&#8217;s cache. </p>
<p>&quot;Present&quot; would govern a search engine&#8217;s ability to display the copy, and a
site may limit that further — for example, to a snippet or to a miniaturized
version, or thumbnail. </p>
</blockquote>
<p>As I said, there&#8217;s an emphasis on granting permission. By default, search
engines assume everything is open to indexing. ACAP changes this assumption,
asking those that create the files to explicitly indicate yes or no.</p>
<p><b>Should You Use It?</b></p>
<p>So now we have a new standard for expressing search engine permissions. Do
site owners need to run out and immediately use it?</p>
<p>No. Not immediately. Not even long term.</p>
<p>Right now, none of the major search engines are supporting ACAP. If you were
to use ACAP without ensuring that standard robots.txt or meta robots commands
were also included, you&#8217;d fail to properly block search engines. Only Exalead,
which is not a major multi-country service, would currently act upon your
ACAP-only commands.</p>
<p>Even if ACAP were to magically get endorsed and supported by all the major
search engines, robots.txt and meta robots support wouldn&#8217;t go away for many
years. There are simply too many sites that use those systems, have used them
for over a decade, and would fail to upgrade. Those two systems will continue to
be supported in the same way Microsoft has had to support DOS programs despite
the growth of Windows.</p>
<p>So why bother at all? Probably two reasons:</p>
<ul>
<li>You want to personally test out how ACAP works, playing with the
permissions and seeing what happens in Exalead<br />
&nbsp;</li>
<li>You want to support the ACAP system and hope that if enough people use it,
perhaps the search engines will adopt it. FYI, ACAP is
<a href="http://www.the-acap.org/press_releases/ACAP_News_Release_NOV07.pdf">
urging</a> (PDF file) &quot;universal adoption&quot; by publishers by the end of next
year.</li>
</ul>
<p><b>Search Engine Support</b></p>
<p>What&#8217;s up with the major services? I emailed the big three, Google,
Microsoft, and Yahoo, all of whom either took part in the working group or are at
today&#8217;s conference. Google&#8217;s canned answer:</p>
<blockquote>
<p>We are interested in all initiatives that allow web publishers and search
engines to work more closely together. We have undertaken many efforts in this
direction over the years including supporting file-extension and wildcard
specifications in robots.txt, SiteMaps, our Webmaster Console, extending
per-item indexing specification to non-html documents and specifying how long
a url would be available. We will examine ACAP proposals when they become
available. As a broad-based search engine, we need to keep in mind the needs
of millions of web publishers worldwide.</p>
</blockquote>
<p>As it happens, I was at Microsoft yesterday, and while I haven&#8217;t gotten a
formal statement to post, the sentiment was the same as Google. Microsoft is
interesting in supporting publishers, is continuing to grow its own tools and
will also watch ACAP, wanting to support publishers in general</p>
<p>Yahoo&#8217;s not sent a statement back yet, but when it arrives, you can expect it
will be pretty much the same as Google and Microsoft.</p>
<p>Why not just jump into ACAP? Between the lines time here &#8212; no one really
wants to hand over control of the standard to the ACAP group, especially in my
view when it has been born out of some anti-search engine hype.</p>
<p>So why not jump behind improving robots.txt and meta robots? Another issue
here is that no one is officially in charge of those standards. The search
engines are sort of the gatekeepers, because it&#8217;s what they decide to support
that effectively becomes &quot;law.&quot; If they don&#8217;t support a particular exclusion
command, it might as well not exist.</p>
<p>The various search engines tell me they have been talking more about making
some collective improvements. Individually, they&#8217;ve already added to both
robots.txt and meta robots over the years, extensions that may work with their
particular search engines. Perhaps they will become more unified.</p>
<p>In particular, they&#8217;ve united around the sitemaps
<a href="http://www.sitemaps.org/">standard</a>. That sort of picks up what ACAP
does in terms of being a system to provide express permission of indexing, and
it&#8217;s where I&#8217;d expect any search engine-driven, collective agreement about
improved blocking tools to emerge.</p>
<p>Also be sure to read <a href="http://searchengineland.com/070416-131549.php">
Up Close &amp; Personal With Robots.txt</a>, which summarizes the second robots.txt
summit that I organized earlier this year. The article covers a lot of things
that general site owners and SEOs have wished for, along with some search engine
responses.</p>
<p><b>Conclusion</b></p>
<p>So has the entire ACAP project been a waste of time, or as Andy Beal&#8217;s great
headline put it when ACAP was announced last year,
<a href="http://www.marketingpilgrim.com/2006/09/publishers-to-spend-half-million.html" rel="bookmark" title="Permanent Link: Publishers to Spend Half Million Dollars on a Robots.txt File">
Publishers to Spend Half Million Dollars on a Robots.txt File</a>? That still
makes me laugh.</p>
<p>No, I&#8217;d say not. I think it&#8217;s been very useful that some group has diligently
and carefully tried to explore the issues, and having ACAP lurking at the very
least gives the search engines themselves a kick in the butt to work on better
standards. Plus, ACAP provides some groundwork they may want to use. Personally,
I doubt ACAP will become Robots.txt 2.0 &#8212; but I suspect elements of ACAP will
flow into that new version or a successor.</p>
]]></content:encoded>
			<wfw:commentRss>http://searchengineland.com/acap-launches-robotstxt-20-for-blocking-search-engines-12802/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
