Irony: If Google Can’t Reach Your Robots.txt File, It Might Not List Your Site

I reported at the Search Engine Roundtable this morning that Google said if your robots.txt is unreachable, your site might not make it into the Google index. By unreachable, Google means that if your server simply times out and does not return any server response when Googlebot attempts to access your robots.txt file, then it might not include any of your pages in their index. Googler John Mueller explained that Google tends to lean on the "safe" side when this situation pops up. When I showed this to Danny, he felt it was ironic that if Google can't read what you want to block, it might bl [...]


Everything You Wanted To Know About Blocking Search Engines

Last week, the three major search engines came together to say how they agree -- and disagree -- over the Robots Exclusion Protocol. It's such an important standard, one every webmaster should understand. To help, Vanessa Fox has compiled an extensive and outstanding overview of it at Jane & Robot in her Managing Robot's Access To Your Website post. The tutorial takes you through key areas such as: A nice chart showing what you can block using either robots.txt or the meta robots tag for each major search engine. It also covers other things like reverse DNS lookup to verify a crawler's [...]


Yahoo!, Google, Microsoft Clarify Robots.txt Support

Today, Google, Yahoo!, and Microsoft have come together to post details of how each of them support robots.txt and the robots meta tag. While their posts use terms like "collaboration" and "working together," they haven't joined together to implement a new standard (as they did with sitemaps.org). Rather, they are simply making a joint stand in messaging that robots.txt is the standard way of blocking search engine robot access to web sites. They have identified a core set of robots.txt and robots meta tag directives that all three engines support: Google and Yahoo! already supported and doc [...]


Google Offers Robots.txt Generator

Google's rolled out a new tool at Google Webmaster Central, a robots.txt generator. It's designed to allow site owners to easily create a robots.txt file, one of the two main ways (along with the meta robots tag) to prevent search engines from indexing content. Robots.txt generators aren't new. You can find many of them out there by searching. But this is the first time a major search engine has provided a generator tool of its own. It's nice to see the addition. Robots.txt files aren't complicated to create. You can write them using a text editor such as notepad with just a few simple comman [...]


SEOs Want The NOINDEX Tag To Not Show A Page In The Index

Matt Cutts of Google posted a blog entry asking SEOs how they want Google to handle the NOINDEX meta tag. If you use the NOINDEX meta tag now, Google won't show the page in any way in the Google index -- not even a "link only" listing. Matt asks SEOs if this is what they want and the poll currently shows us that yes, SEOs want it this way. Here are the current results, but the results may change over the course of the week: How should Google treat the NOINDEX meta tag? 240 say "Don't show a page at all." 24 say "Find some middle ground." 23 say "Show a link to the page." Google Explains [...]


Yahoo Search Weather Update & Support For X-Robots Tag

The Yahoo Blog issued a weather report for changes to rankings in Yahoo Search, along with news that they are now supporting the X-Robots-Tag directive -- a way to control indexing of content that cannot accept meta robots tags. Google also supports X-Robots, which gives webmasters the ability to define robots.txt like rules within http headers, as opposed to just the META data within HTML pages. Yahoo provided a few examples of how it can work: X-Robots-Tag: NOINDEX -- If you don't want to show the URL in the Yahoo! Search results. Note: We'll still need to crawl the page to see and apply [...]


ACAP Launches, Robots.txt 2.0 For Blocking Search Engines?

After a year of discussions, ACAP -- Automated Content Access Protocol -- was released today as a sort of robots.txt 2.0 system for telling search engines what they can or can't include in their listings. However, none of the major search engines support ACAP, and its future remains firmly one of "watch and see." Below, more about the how and why of ACAP. Let's start with some history. ACAP got going in September 2006, backed by major European newspaper and publishing groups that in particular felt Google was using content without proper permissions and wanting a more flexible me [...]


Robots.txt Study Shows Webmasters Favor Google; BotSeer Robots.txt Search Engine Released

The Pennsylvania State University conducted a study that showed webmasters favored Google over other search engines in terms of allowing access to their web sites. An associated BotSeer search engine that allows searching across a collection of robots.txt files was also released. The study looked at which robots or crawlers were listed in a web site's robots.txt file, and Google was listed more often than any other search engine. The paper is named Determining Bias to Search Engines from Robots.txt (PDF) (it may be slow, so here is a local copy) and showed some interesting details. The mos [...]


How Proxy Hacking Can Hurt Your Rankings & What To Do About It

Google Proxy Hacking: How A Third Party Can Remove Your Site From Google SERPs by Dan Thies gives us a detailed look at the serious dangers of proxy hacking. Dan's detailed article shows the history on how he discovered the issue. He then goes into why the hacking currently works in Google. Dan is eager to encourage the search engines to do something about the issue. But Dan has provided details on how to help protect yourself with the help of some friends. [...]


Google Enhances Webmaster Central’s Robots.txt Analysis Tool

The Google Webmaster Central Blog announced improvements they have made to the robots.txt analysis tool. The tool now recognizes all sitemap declarations and relative URLs. So now the tool will report the validity of all sitemaps URLs plus show data for relative URLs. In addition, Google has expanded the reporting to include data for not just the first problem encountered, like they did in the past. Now they also show all problems encountered, on multiple lines and itemized by line number. [...]


Google’s “Unavailable After” META Tag Now Live

Google's Dan Crow announced today that the unavailable_after META tag is now live and operational. Google To Add "Unavailable After" META Tag from about two weeks ago, explains in detail more about this tag and how it can be used. [...]


More Info On Google’s Unavailable After Meta Tag & New X-Robots-Tag In Header Support

Last week we reported that Google was to add an "Unavailable After" META Tag. Since then, we've spoke to Dan Crow of Google, who provided more information on how to use it, as well information on a new way to send robots blocking info within HTTP headers. The "unavailable_after" Meta tag will allow you to tell Google that a page should expire from the search results at a specific time. For example, if you have a page that you would like to be removed from the search results at 6pm EST on July 23, 2007, you would add the following Meta tag: <META NAME="GOOGLEBOT" CONTENT="unavailable_aft [...]


Google To Add “Unavailable After” META Tag

Getting Into Google by Jill Whalen reports Dan Crow, director of crawl systems at Google, saying that Google is releasing a new META tag named "unavailable_after." The "unavailable_after" tag will allow you to tell Google when Googlebot should no longer crawl that page. Jill explains that this tag comes in handy when you have a limited time offer promotional page, and on this page, the promotion will expire on a specific date. By using the "unavailable_after" tag, you can tell Google that they should not crawl this page, after the promotion expires. There are several practical scenarios fo [...]


Search Illustrated: Blocking Search Engines With Robots.txt

While most of the time we want search engine crawlers to grab and index as much content from our web sites as possible, there are situations where we want to prevent crawlers from accessing certain pages or parts of a web site. For example, you don't want crawlers poking around on non-public parts of your web site. Nor do you want them trying to index scripts, utilities or other types of code. And finally, you may have duplicate content on your web site, and want to ensure that a crawler only gets one copy (the "canonical" version, in search engine parlance). Today's Search Illustrated i [...]


Belgian Papers Back In Google; Begin Using Standards For Blocking

Belgian newspapers that sued Google to be removed from its index are now back in, having agreed to use the commonly-accepted blocking standards that they initially rejected as not being legal. Google and the group representing the papers, Copiepresse, have issued a joint statement. That's below, along with a look at how this is a victory for Google, which has had to settle a series of similar lawsuits through agreements. Let's start with the joint statement: Internet users interested in Belgian news and users of Google’s search engine may have noticed today that the websites of the Belg [...]


Yahoo Supports New Robots-Nocontent Tag To Block Indexing Within A Page

For over a decade, search engines have supported standards allowing you to prevent pages from being spidered or included within a search index. Today, Yahoo now supports a new twist -- a way to flag that part of your page shouldn't be included in an index. It's called the robots-nocontent tag. Many search marketers have long struggled with the problem that the "core" content of a web page -- the main body copy or article -- can often seemed drowned out from a text analytics perspective by all the clutter around the content. That clutter is often ads, navigational links, cross promot [...]


From The Isn’t It Ironic Dept: Google Product Search’s Results Show Up In Google

Remember how Google said recently that it might crack down on listings pages that are simply search results themselves? Reader Michael Nguyen dropped an email today to point out how, ironically, Google is now listing pages from its own Google Product Search service exactly as it has warned others not to do. OK, settle down back there, those of you having a chuckle. Embarrassing? Yes! Intentional? Almost certainly not. Let's take a look. Try a search for snake light, and you'll get this: See down there at the bottom? Two pages from Google Product Search showing up in the top results: I [...]


How Search Engines Handle The Nofollow Attribute

Loren Baker at Search Engine Journal has a nice write up on how the search engines handle the nofollow attribute now just over two years since it was introduced. Ask.com still does not follow the tag, so here are the takeaway for Google and Yahoo: Google won't follow the link, Yahoo will (note) Google and Yahoo won't pass link popularity for that specific link Google would hope that Wikipedia would not take such an "absolute approach" on the nofollow link attribute being applied so widely. Note From Danny: Google WILL follow nofollow links in the sense that if someone else links to a page [...]


Google Releases Improved Content Removal Tools

Google has rolled out new tools to help people quickly get content removed from its search engine. Those targeted at site owners allow for speedy removal of pages and cached copies of pages. Other tools allow those to request the removal of images or links to pages with personal information about themselves, in the right circumstances. More on the tools and various options are covered below. Site Owner Removal Options For site owners, the best way to keep content out of Google is by using the robots.txt or meta robots tag options. Either option can prevent pages from getting into Google or ge [...]


Up Close & Personal With Robots.txt

The Robots.txt Summit at Search Engine Strategies New York 2007 was the latest in a series of special sessions with the intent to open a dialog between search engines representatives and web site publishers. Past summits featured discussion on comment spam on blogs, indexing issues and redirects. The subject of this latest summit was to discuss the humble but terribly important robots.txt file. Danny Sullivan moderated, with panelists Keith Hogan, Director of Program Management, Search Technology, Ask.com, Sean Suchter, Director of Yahoo Search Technology, Yahoo Search, Dan Crow, Product [...]


Get Our News, Everywhere!

Daily Email:

Follow Search Engine Land on Twitter @sengineland Like Search Engine Land on Facebook Follow Search Engine Land on Google+ Get the Search Engine Land Feed Connect with Search Engine Land on LinkedIn Check out our Tumblr! See us on Pinterest

 
 

Click to watch SMX conference video

Join us at one of our SMX or MarTech events:

United States

Europe

Australia & China

Learn more about: SMX | MarTech


Free Daily Search News Recap!

SearchCap is a once-per-day newsletter update - sign up below and get the news delivered to you!

 


 

Search Engine Land Periodic Table of SEO Success Factors

Get Your Copy
Read The Full SEO Guide