Head-To-Head: ACAP Versus Robots.txt For Controlling Search Engines

In the battle between search engines and some mainstream news publishers, ACAP has been lurking for several years. ACAP -- the Automated Content Access Protocol -- has constantly been positioned by some news executives as a cornerstone to reestablishing the control they feel has been lost over their content. However, the reality is that publishers have more control even without ACAP than is commonly believed by some. In addition, ACAP currently provides no "DRM" or licensing mechanisms over news content. But the system does offer some ideas well worth considering. Below, a look at how it [...]


A Deeper Look At Robots.txt

The Robots Exclusion Protocol (REP) is not exactly a complicated protocol and its uses are fairly limited, and thus it’s usually given short shrift by SEOs. Yet there’s a lot more to it than you might think. Robots.txt has been with us for over 14 years, but how many of us knew that in addition to the disallow directive there’s a noindex directive that Googlebot obeys? That noindexed pages don’t end up in the index but disallowed pages do, and the latter can show up in the search results (albeit with less information since the spiders can’t see the page content)? That disallowed page [...]


Google’s Advice On Using The New Canonical Tag

A month ago, Google, Yahoo and Microsoft announced they will be supporting a new canonical tag that allows you to tell search engines that page X is a duplicate page to page Z. In a way, it is a 301 redirect, without the physical redirect. The tag is incredibly powerful, as are 301 redirects and using this tag should be done with caution and slowly. Matt Cutts posted a new video explaining how one should go about using this tag, being that it is so new. Here is the video: [...]


Live Search Testing New Crawler; MSNBot/2.0b

The Live Search Blog announced they are letting a new robot loose. The new search engine crawler is named msnbot/2.0b and will be added to the army of current MSN spiders, currently named msnbot/1.1. The new spider is currently being tested but will ultimately replace the old spider. The new spider will respect the current robots.txt protocol set up for MSNBot, so no need to set up anything new in your robots.txt file. In addition, Microsoft promised to crawl slowly in their msnbot/2.0b tests. MSNBot/1.1 is not that old. It was added back in February of this year and introduced HTTP [...]


Irony: If Google Can’t Reach Your Robots.txt File, It Might Not List Your Site

I reported at the Search Engine Roundtable this morning that Google said if your robots.txt is unreachable, your site might not make it into the Google index. By unreachable, Google means that if your server simply times out and does not return any server response when Googlebot attempts to access your robots.txt file, then it might not include any of your pages in their index. Googler John Mueller explained that Google tends to lean on the "safe" side when this situation pops up. When I showed this to Danny, he felt it was ironic that if Google can't read what you want to block, it might bl [...]


Everything You Wanted To Know About Blocking Search Engines

Last week, the three major search engines came together to say how they agree -- and disagree -- over the Robots Exclusion Protocol. It's such an important standard, one every webmaster should understand. To help, Vanessa Fox has compiled an extensive and outstanding overview of it at Jane & Robot in her Managing Robot's Access To Your Website post. The tutorial takes you through key areas such as: A nice chart showing what you can block using either robots.txt or the meta robots tag for each major search engine. It also covers other things like reverse DNS lookup to verify a crawler's [...]


Yahoo!, Google, Microsoft Clarify Robots.txt Support

Today, Google, Yahoo!, and Microsoft have come together to post details of how each of them support robots.txt and the robots meta tag. While their posts use terms like "collaboration" and "working together," they haven't joined together to implement a new standard (as they did with sitemaps.org). Rather, they are simply making a joint stand in messaging that robots.txt is the standard way of blocking search engine robot access to web sites. They have identified a core set of robots.txt and robots meta tag directives that all three engines support: Google and Yahoo! already supported and doc [...]


Google Offers Robots.txt Generator

Google's rolled out a new tool at Google Webmaster Central, a robots.txt generator. It's designed to allow site owners to easily create a robots.txt file, one of the two main ways (along with the meta robots tag) to prevent search engines from indexing content. Robots.txt generators aren't new. You can find many of them out there by searching. But this is the first time a major search engine has provided a generator tool of its own. It's nice to see the addition. Robots.txt files aren't complicated to create. You can write them using a text editor such as notepad with just a few simple comman [...]


SEOs Want The NOINDEX Tag To Not Show A Page In The Index

Matt Cutts of Google posted a blog entry asking SEOs how they want Google to handle the NOINDEX meta tag. If you use the NOINDEX meta tag now, Google won't show the page in any way in the Google index -- not even a "link only" listing. Matt asks SEOs if this is what they want and the poll currently shows us that yes, SEOs want it this way. Here are the current results, but the results may change over the course of the week: How should Google treat the NOINDEX meta tag? 240 say "Don't show a page at all." 24 say "Find some middle ground." 23 say "Show a link to the page." Google Explains [...]


Yahoo Search Weather Update & Support For X-Robots Tag

The Yahoo Blog issued a weather report for changes to rankings in Yahoo Search, along with news that they are now supporting the X-Robots-Tag directive -- a way to control indexing of content that cannot accept meta robots tags. Google also supports X-Robots, which gives webmasters the ability to define robots.txt like rules within http headers, as opposed to just the META data within HTML pages. Yahoo provided a few examples of how it can work: X-Robots-Tag: NOINDEX -- If you don't want to show the URL in the Yahoo! Search results. Note: We'll still need to crawl the page to see and apply [...]


ACAP Launches, Robots.txt 2.0 For Blocking Search Engines?

After a year of discussions, ACAP -- Automated Content Access Protocol -- was released today as a sort of robots.txt 2.0 system for telling search engines what they can or can't include in their listings. However, none of the major search engines support ACAP, and its future remains firmly one of "watch and see." Below, more about the how and why of ACAP. Let's start with some history. ACAP got going in September 2006, backed by major European newspaper and publishing groups that in particular felt Google was using content without proper permissions and wanting a more flexible me [...]


Robots.txt Study Shows Webmasters Favor Google; BotSeer Robots.txt Search Engine Released

The Pennsylvania State University conducted a study that showed webmasters favored Google over other search engines in terms of allowing access to their web sites. An associated BotSeer search engine that allows searching across a collection of robots.txt files was also released. The study looked at which robots or crawlers were listed in a web site's robots.txt file, and Google was listed more often than any other search engine. The paper is named Determining Bias to Search Engines from Robots.txt (PDF) (it may be slow, so here is a local copy) and showed some interesting details. The mos [...]


How Proxy Hacking Can Hurt Your Rankings & What To Do About It

Google Proxy Hacking: How A Third Party Can Remove Your Site From Google SERPs by Dan Thies gives us a detailed look at the serious dangers of proxy hacking. Dan's detailed article shows the history on how he discovered the issue. He then goes into why the hacking currently works in Google. Dan is eager to encourage the search engines to do something about the issue. But Dan has provided details on how to help protect yourself with the help of some friends. [...]


Google Enhances Webmaster Central’s Robots.txt Analysis Tool

The Google Webmaster Central Blog announced improvements they have made to the robots.txt analysis tool. The tool now recognizes all sitemap declarations and relative URLs. So now the tool will report the validity of all sitemaps URLs plus show data for relative URLs. In addition, Google has expanded the reporting to include data for not just the first problem encountered, like they did in the past. Now they also show all problems encountered, on multiple lines and itemized by line number. [...]


Google’s “Unavailable After” META Tag Now Live

Google's Dan Crow announced today that the unavailable_after META tag is now live and operational. Google To Add "Unavailable After" META Tag from about two weeks ago, explains in detail more about this tag and how it can be used. [...]


More Info On Google’s Unavailable After Meta Tag & New X-Robots-Tag In Header Support

Last week we reported that Google was to add an "Unavailable After" META Tag. Since then, we've spoke to Dan Crow of Google, who provided more information on how to use it, as well information on a new way to send robots blocking info within HTTP headers. The "unavailable_after" Meta tag will allow you to tell Google that a page should expire from the search results at a specific time. For example, if you have a page that you would like to be removed from the search results at 6pm EST on July 23, 2007, you would add the following Meta tag: <META NAME="GOOGLEBOT" CONTENT="unavailable_aft [...]


Google To Add “Unavailable After” META Tag

Getting Into Google by Jill Whalen reports Dan Crow, director of crawl systems at Google, saying that Google is releasing a new META tag named "unavailable_after." The "unavailable_after" tag will allow you to tell Google when Googlebot should no longer crawl that page. Jill explains that this tag comes in handy when you have a limited time offer promotional page, and on this page, the promotion will expire on a specific date. By using the "unavailable_after" tag, you can tell Google that they should not crawl this page, after the promotion expires. There are several practical scenarios fo [...]


Search Illustrated: Blocking Search Engines With Robots.txt

While most of the time we want search engine crawlers to grab and index as much content from our web sites as possible, there are situations where we want to prevent crawlers from accessing certain pages or parts of a web site. For example, you don't want crawlers poking around on non-public parts of your web site. Nor do you want them trying to index scripts, utilities or other types of code. And finally, you may have duplicate content on your web site, and want to ensure that a crawler only gets one copy (the "canonical" version, in search engine parlance). Today's Search Illustrated i [...]


Belgian Papers Back In Google; Begin Using Standards For Blocking

Belgian newspapers that sued Google to be removed from its index are now back in, having agreed to use the commonly-accepted blocking standards that they initially rejected as not being legal. Google and the group representing the papers, Copiepresse, have issued a joint statement. That's below, along with a look at how this is a victory for Google, which has had to settle a series of similar lawsuits through agreements. Let's start with the joint statement: Internet users interested in Belgian news and users of Google’s search engine may have noticed today that the websites of the Belg [...]


Yahoo Supports New Robots-Nocontent Tag To Block Indexing Within A Page

For over a decade, search engines have supported standards allowing you to prevent pages from being spidered or included within a search index. Today, Yahoo now supports a new twist -- a way to flag that part of your page shouldn't be included in an index. It's called the robots-nocontent tag. Many search marketers have long struggled with the problem that the "core" content of a web page -- the main body copy or article -- can often seemed drowned out from a text analytics perspective by all the clutter around the content. That clutter is often ads, navigational links, cross promot [...]


Get Our News, Everywhere!

Daily Email:

Follow Search Engine Land on Twitter @sengineland Like Search Engine Land on Facebook Follow Search Engine Land on Google+ Get the Search Engine Land Feed Connect with Search Engine Land on LinkedIn Check out our Tumblr! See us on Pinterest

 
 

Click to watch SMX conference video

Join us at one of our SMX or MarTech events:

United States

Europe

Australia & China

Learn more about: SMX | MarTech


Free Daily Search News Recap!

SearchCap is a once-per-day newsletter update - sign up below and get the news delivered to you!

 


 

Search Engine Land Periodic Table of SEO Success Factors

Get Your Copy
Read The Full SEO Guide