Google Offers Robots.txt Generator
rolled out a new tool at Google
Webmaster Central, a robots.txt generator. It’s designed to allow site
owners to easily create a robots.txt file, one of the two main ways (along with
the meta robots tag)
to prevent search engines from indexing content. Robots.txt generators aren’t
new. You can find many of them out there by searching. But this is the first
time a major search engine has provided a generator tool of its own.
It’s nice to see the addition. Robots.txt files aren’t complicated to create.
You can write them using a text editor such as notepad with just a few simple
commands. But they can still be scary or hard for some site owners to
To access the tool, log-in to your
Google Webmaster Tools
account, then click on the Tools menu option on the left-hand side of the screen
after you select one of your verified sites. You’ll see a "Generate robots.txt"
link among the tool options. That’s what you want.
By default, the tool is designed to let you create a robots.txt file to allow
all robots into your site. That’s kind of odd. By default, all robots will come
into your site. If you want them, then there’s no need to have a robots.txt file
at all. It’s like pinning a note to your chest reminding yourself to breathe.
Promise, you’ll keep breathing even if you forget to look at the note.
Instead, you generally want to put up a robots.txt file to block crawling of
some type. I may dig into a future article to examine when you might want to mix
allow and disallow statements, but off the top of my head, there’s not a lot of reasons
to do so.
You can change the default option to "Block all robots" easily enough. Do
that, and you get the standard and familiar two line keep out code:
The first line — User-Agent — is how you tell particular spiders or robots
to pay attention to the following instructions. Using the wildcard — * — says
"hey ALL spiders, listen up."
The second line says what they can’t access. In this case, the / means to not
spider anything within the web site. You know how pages within a web site all
begin domain/something, like this:
See that / between website.com and page.html? Technically, that slash is the
start of the URL. So if you disallow all pages beginning with a slash, you’re
blocking all pages within the entire site.
Let’s move on from our mini-robots.txt 101 course. Maybe you only want to
block Google. Well, the tool is supposed to make this type of thing easy, but I
was perplexed. Step one is to either allow or block ALL robots. Then in Step 2,
you decide if you want to block specific robots. So which do you go with in step
1, block all or none?
I figured you’d want to allow all robots, then believe the reassuring text
next to that option that said "you can fine-tune this rule in the next step."
The problem is, I couldn’t. If I tried to block Googlebot, the instructions
didn’t change. If I tried to choose, say, Googlebot-Mobile, same thing.
Eventually, I figured it out. If you decide to block specific spiders, you
have to choose the spider, then specify also what you want to block in the
"Files or directories" box, such as a particular file or directory. So say I
kept all print-only versions of stories in a directory called /print. I’d enter
that directory to get this:
The first part tells spiders they can access the entire site. As I said, this
is entirely unnecessary, but you get it anyway. The second part says that
Googlebot cannot access the /print area.
The tool lets you craft specific rules for these particular Google crawlers:
I wish the names were accompanied by parenthesis quickly explaining what each
crawler does, and what blocking them will do, say, something like this:
- Googlebot-Mobile (allows or blocks content from Google mobile search)
Instead, you have to look through the various
help files to understand what each does. Ironically, the
older Analyze Robots.txt
tool within Google Webmaster Tools DOES have these helpful explanations, so I
expect they’ll migrate over.
You can also use the tool to enter a name for another crawler. The problem
is, someone using this tool probably doesn’t know the crawler names out there
that they want to block. I’d have given Google serious kudos points if they added
some of the other major crawlers. But then again, if they had, no doubt someone
would have accused them of trying to get people to block other search engines :)
Another thing that would have been nice was if people could have pasted full
URLs into the box to have them converted. A site owner using this tool might not
realize they need to drop the domain portion of a URL to block a particular
page. But if you could paste something like this:
And have the tool automatically turn it into this:
After you make your file, upload it to the root directory of your web site.
If you don’t know what that is, find someone who does! This is important. Google
allows for subdirectories of web sites to be registered within Google Webmaster
Tools. However, robots.txt files do NOT work on a subdirectory basis. They have
to go at the root level of a web site. If you don’t put them there, then you
won’t be preventing access to any part of the site. Remember, after you upload
to the root level, you can go back into Google Webmaster Tools and use that
aforementioned analysis tool to see if it is really blocking the pages you want
to keep out.
Overall, I’m glad to see the new tool, and I imagine it will improve more
over time to make it even more user friendly.
In related news, Google says that the Web Crawl diagnostics area now has a new
filter letting you see only web crawl errors related to sitemaps you’ve
submitted. Also, there have been some UI tweaks to the iGoogle gadgets from
Webmaster Central that were
rolled out last