Subscribe Via Web Feed Subscribe with Google Add to My Yahoo! Subscribe with Bloglines Add to netvibes Subscribe with Live.com

« Privacy Policies And Search Engines | Main | Drilling Into Google's Decline In Paid Clicks »

Mar. 27, 2008 at 5:39pm Eastern by Danny Sullivan

Google Offers Robots.txt Generator

Google's rolled out a new tool at Google Webmaster Central, a robots.txt generator. It's designed to allow site owners to easily create a robots.txt file, one of the two main ways (along with the meta robots tag) to prevent search engines from indexing content. Robots.txt generators aren't new. You can find many of them out there by searching. But this is the first time a major search engine has provided a generator tool of its own.

It's nice to see the addition. Robots.txt files aren't complicated to create. You can write them using a text editor such as notepad with just a few simple commands. But they can still be scary or hard for some site owners to contemplate.

To access the tool, log-in to your Google Webmaster Tools account, then click on the Tools menu option on the left-hand side of the screen after you select one of your verified sites. You'll see a "Generate robots.txt" link among the tool options. That's what you want.

By default, the tool is designed to let you create a robots.txt file to allow all robots into your site. That's kind of odd. By default, all robots will come into your site. If you want them, then there's no need to have a robots.txt file at all. It's like pinning a note to your chest reminding yourself to breathe. Promise, you'll keep breathing even if you forget to look at the note.

Instead, you generally want to put up a robots.txt file to block crawling of some type. I may dig into a future article to examine when you might want to mix allow and disallow statements, but off the top of my head, there's not a lot of reasons to do so.

You can change the default option to "Block all robots" easily enough. Do that, and you get the standard and familiar two line keep out code:

User-Agent: *
Disallow: /

The first line -- User-Agent -- is how you tell particular spiders or robots to pay attention to the following instructions. Using the wildcard -- * -- says "hey ALL spiders, listen up."

The second line says what they can't access. In this case, the / means to not spider anything within the web site. You know how pages within a web site all begin domain/something, like this:

http://website.com/page.html

See that / between website.com and page.html? Technically, that slash is the start of the URL. So if you disallow all pages beginning with a slash, you're blocking all pages within the entire site.

Let's move on from our mini-robots.txt 101 course. Maybe you only want to block Google. Well, the tool is supposed to make this type of thing easy, but I was perplexed. Step one is to either allow or block ALL robots. Then in Step 2, you decide if you want to block specific robots. So which do you go with in step 1, block all or none?

I figured you'd want to allow all robots, then believe the reassuring text next to that option that said "you can fine-tune this rule in the next step." The problem is, I couldn't. If I tried to block Googlebot, the instructions didn't change. If I tried to choose, say, Googlebot-Mobile, same thing.

Eventually, I figured it out. If you decide to block specific spiders, you have to choose the spider, then specify also what you want to block in the "Files or directories" box, such as a particular file or directory. So say I kept all print-only versions of stories in a directory called /print. I'd enter that directory to get this:

User-Agent: *
Allow: /

User-Agent: Googlebot
Disallow: /print
Allow: /

The first part tells spiders they can access the entire site. As I said, this is entirely unnecessary, but you get it anyway. The second part says that Googlebot cannot access the /print area.

The tool lets you craft specific rules for these particular Google crawlers:

  • Googlebot
  • Googlebot-Mobile
  • Googlebot-Image
  • Mediapartners-Google
  • Adsbot-Google

I wish the names were accompanied by parenthesis quickly explaining what each crawler does, and what blocking them will do, say, something like this:

  • Googlebot-Mobile (allows or blocks content from Google mobile search)

Instead, you have to look through the various help files to understand what each does. Ironically, the older Analyze Robots.txt tool within Google Webmaster Tools DOES have these helpful explanations, so I expect they'll migrate over.

You can also use the tool to enter a name for another crawler. The problem is, someone using this tool probably doesn't know the crawler names out there that they want to block. I'd have given Google serious kudos points if they added some of the other major crawlers. But then again, if they had, no doubt someone would have accused them of trying to get people to block other search engines :)

Another thing that would have been nice was if people could have pasted full URLs into the box to have them converted. A site owner using this tool might not realize they need to drop the domain portion of a URL to block a particular page. But if you could paste something like this:

http://website.com/page-i-want-to-block.html

And have the tool automatically turn it into this:

User-Agent: *
Disallow: /page-i-want-to-block.html

After you make your file, upload it to the root directory of your web site. If you don't know what that is, find someone who does! This is important. Google allows for subdirectories of web sites to be registered within Google Webmaster Tools. However, robots.txt files do NOT work on a subdirectory basis. They have to go at the root level of a web site. If you don't put them there, then you won't be preventing access to any part of the site. Remember, after you upload to the root level, you can go back into Google Webmaster Tools and use that aforementioned analysis tool to see if it is really blocking the pages you want to keep out.

Overall, I'm glad to see the new tool, and I imagine it will improve more over time to make it even more user friendly.

In related news, Google says that the Web Crawl diagnostics area now has a new filter letting you see only web crawl errors related to sitemaps you've submitted. Also, there have been some UI tweaks to the iGoogle gadgets from Webmaster Central that were rolled out last month.

For more about Google's webmaster tools, be sure to check out the quick start guide they offer and see our Google Webmaster Central archives.

Like The Story? Vote For It On Yahoo Buzz!
Subscribe To Our Daily Search News Recap!
Your Email:
Send me the monthly search newsletter too! (Learn more about our newsletters and feeds)
Subscribe To Our Search Feed!
Subscribe Via Web FeedSubscribe with GoogleAdd to My Yahoo!Subscribe with BloglinesAdd to netvibes
Subscribe with Live.comSubscribe in NewsGator OnlineSubscribe in RojoAdd to My AOL
Share & Bookmark This Story!
By Danny Sullivan Permalink Jump To Comments See Related Stories In: Google: SEO, Google: Webmaster Central, SEO: Blocking Spiders

Reader Comments

Search:

The Enterprise Class Search Management Application
SEOMoz Premium Membership
Boost Your Earnings Per Click with Click Forensics
Register Now For SMX Advanced!
Search Marketing Expo

Save the date for:
SMX Madrid (in Spanish, May 20-21)
SMX Advanced - Seattle, WA (June 3-4) Register today! Early bird rate expires May 9!
SMX Local & Mobile - San Francisco, CA (July 24-25) (July 24-25) Pre-agenda rate expires May 2. Get the lowest rate by registering now.
SMX East - NYC - (Oct. 6-8)
SMX London - November 4 & 5, 2008

Search Marketing Now

Learn more about search marketing through free online webcasts and webinars from our sister site Search Marketing Now.

Upcoming Webcasts:

Most Recent News Posts

About Search Engine Land

Stay Updated!

Get Our Search Newsletters:
Email:
Daily Monthly

Get Our Search Feed:
Subscribe Via Web FeedSubscribe with Google
Add to My Yahoo!Subscribe with Bloglines
Add to netvibesSubscribe with Live.com
Subscribe in NewsGator OnlineSubscribe in Rojo
Add to My AOL
More About Our Feeds & Newsletters

Add to Technorati Favorites

Track Us Socially:
Facebook: Our Search News App
Facebook: Search Engine Land Page
Facebook: Search Engine Land Group
Flickr: Search Engine Land
LinkedIn: Search Engine Land Group
Twitter: Search Engine Land Feed

Bragroll