Google’s rolled out a new tool at Google Webmaster Central, a robots.txt generator. It’s designed to allow site owners to easily create a robots.txt file, one of the two main ways (along with the meta robots tag) to prevent search engines from indexing content. Robots.txt generators aren’t new. You can find many of them out there by searching. But this is the first time a major search engine has provided a generator tool of its own.
It’s nice to see the addition. Robots.txt files aren’t complicated to create. You can write them using a text editor such as notepad with just a few simple commands. But they can still be scary or hard for some site owners to contemplate.
To access the tool, log-in to your Google Webmaster Tools account, then click on the Tools menu option on the left-hand side of the screen after you select one of your verified sites. You’ll see a "Generate robots.txt" link among the tool options. That’s what you want.
By default, the tool is designed to let you create a robots.txt file to allow all robots into your site. That’s kind of odd. By default, all robots will come into your site. If you want them, then there’s no need to have a robots.txt file at all. It’s like pinning a note to your chest reminding yourself to breathe. Promise, you’ll keep breathing even if you forget to look at the note.
Instead, you generally want to put up a robots.txt file to block crawling of some type. I may dig into a future article to examine when you might want to mix allow and disallow statements, but off the top of my head, there’s not a lot of reasons to do so.
You can change the default option to "Block all robots" easily enough. Do that, and you get the standard and familiar two line keep out code:
The first line — User-Agent — is how you tell particular spiders or robots to pay attention to the following instructions. Using the wildcard — * — says "hey ALL spiders, listen up."
The second line says what they can’t access. In this case, the / means to not spider anything within the web site. You know how pages within a web site all begin domain/something, like this:
See that / between website.com and page.html? Technically, that slash is the start of the URL. So if you disallow all pages beginning with a slash, you’re blocking all pages within the entire site.
Let’s move on from our mini-robots.txt 101 course. Maybe you only want to block Google. Well, the tool is supposed to make this type of thing easy, but I was perplexed. Step one is to either allow or block ALL robots. Then in Step 2, you decide if you want to block specific robots. So which do you go with in step 1, block all or none?
I figured you’d want to allow all robots, then believe the reassuring text next to that option that said "you can fine-tune this rule in the next step." The problem is, I couldn’t. If I tried to block Googlebot, the instructions didn’t change. If I tried to choose, say, Googlebot-Mobile, same thing.
Eventually, I figured it out. If you decide to block specific spiders, you have to choose the spider, then specify also what you want to block in the "Files or directories" box, such as a particular file or directory. So say I kept all print-only versions of stories in a directory called /print. I’d enter that directory to get this:
The first part tells spiders they can access the entire site. As I said, this is entirely unnecessary, but you get it anyway. The second part says that Googlebot cannot access the /print area.
The tool lets you craft specific rules for these particular Google crawlers:
I wish the names were accompanied by parenthesis quickly explaining what each crawler does, and what blocking them will do, say, something like this:
- Googlebot-Mobile (allows or blocks content from Google mobile search)
Instead, you have to look through the various help files to understand what each does. Ironically, the older Analyze Robots.txt tool within Google Webmaster Tools DOES have these helpful explanations, so I expect they’ll migrate over.
You can also use the tool to enter a name for another crawler. The problem is, someone using this tool probably doesn’t know the crawler names out there that they want to block. I’d have given Google serious kudos points if they added some of the other major crawlers. But then again, if they had, no doubt someone would have accused them of trying to get people to block other search engines :)
Another thing that would have been nice was if people could have pasted full URLs into the box to have them converted. A site owner using this tool might not realize they need to drop the domain portion of a URL to block a particular page. But if you could paste something like this:
And have the tool automatically turn it into this:
After you make your file, upload it to the root directory of your web site. If you don’t know what that is, find someone who does! This is important. Google allows for subdirectories of web sites to be registered within Google Webmaster Tools. However, robots.txt files do NOT work on a subdirectory basis. They have to go at the root level of a web site. If you don’t put them there, then you won’t be preventing access to any part of the site. Remember, after you upload to the root level, you can go back into Google Webmaster Tools and use that aforementioned analysis tool to see if it is really blocking the pages you want to keep out.
Overall, I’m glad to see the new tool, and I imagine it will improve more over time to make it even more user friendly.
In related news, Google says that the Web Crawl diagnostics area now has a new filter letting you see only web crawl errors related to sitemaps you’ve submitted. Also, there have been some UI tweaks to the iGoogle gadgets from Webmaster Central that were rolled out last month.