Apr 16, 2009 at 8:00am ET by Stephan Spencer
The Robots Exclusion Protocol (REP) is not exactly a complicated protocol and its uses are fairly limited, and thus it’s usually given short shrift by SEOs. Yet there’s a lot more to it than you might think. Robots.txt has been with us for over 14 years, but how many of us knew that in addition to the disallow directive there’s a noindex directive that Googlebot obeys? That noindexed pages don’t end up in the index but disallowed pages do, and the latter can show up in the search results (albeit with less information since the spiders can’t see the page content)? That disallowed pages still accumulate PageRank? That robots.txt can accept a limited form of pattern matching? That, because of that last feature, you can selectively disallow not just directories but also particular filetypes (well, file extensions to be more exact)? That a robots.txt disallowed page can’t be accessed by the spiders, so they can’t read and obey a meta robots tag contained within the page?
A robots.txt file provides critical information for search engine spiders that crawl the web. Before these bots (does anyone say the full word “robots” anymore?) access pages of a site, they check to see if a robots.txt file exists. Doing so makes crawling the web more efficient, because the robots.txt file keeps the bots from accessing certain pages that should not be indexed by the search engines.
Having a robots.txt file is a best practice. Even just for the simple reason that some metrics programs will interpret the 404 response to the request for a missing robots.txt file as an error, which could result in erroneous performance reporting. But what goes in that robots.txt file? That’s the crux of it.
Both robots.txt and robots meta tags rely on cooperation from the robots, and are by no means guaranteed to work for every bot. If you need stronger protection from unscrupulous robots and other agents, you should use alternative methods such as password protection. Too many times I’ve seen webmasters naively place sensitive URLs such as administrative areas in robots.txt. You better believe robots.txt is one of the hacker’s first ports of call—to see where they should break into.
Robots.txt works well for:
At the risk of being Captain Obvious, the robots.txt file must reside in the root of the domain and must be named “robots.txt” (all lowercase). A robots.txt file located in a subdirectory isn’t valid, as bots only check for this file in the root of the domain.
Creating a robots.txt file is easy. You can create a robots.txt file in any text editor. It should be an ASCII-encoded text file, not an HTML file.
Robots.txt syntax
Let’s look at an example robots.txt file. The example below includes:
User-agent: Googlebot
Disallow:
User-agent: msnbot
Disallow: /
# Block all robots from tmp and logs directories
User-agent: *
Disallow: /tmp/
Disallow: /logs # for directories and files called logs
What should be listed on the User-Agent line? A user-agent is the name of a specific search engine robot. You can set an entry to apply to a specific bot (by listing the name) or you can set it to apply to all bots (by listing an asterisk, which acts as a wildcard). An entry that applies to all bots looks like this:
User-Agent: *
Major robots include: Googlebot (Google), Slurp (Yahoo!), msnbot (MSN), and TEOMA (Ask).
Bear in mind that a block of directives specified for the user-agent of Googlebot will be obeyed by Googlebot; but Googlebot will NOT ALSO obey the directives for the user-agent of * (all bots).
What should be listed on the Disallow line? The disallow lists the pages you want to block. You can list a specific URL or a pattern. The entry should begin with a forward slash (/).
Examples:
Disallow: /Disallow: /private_directory/Disallow: /private_file.htmlDisallow: /privateIf you serve content via both http and https, you’ll need a separate robots.txt file for each of these protocols. For example, to allow robots to index all http pages but no https pages, you’d use the robots.txt files as follows, for your http protocol:
User-agent: *
Disallow:
And for the https protocol:
User-agent: *
Disallow: /
Bots check for the robots.txt file each time they come to a website. The rules in the robots.txt file will be in effect immediately once it is uploaded to the site’s root and the bot comes to the site. How often it is accessed varies on how frequently the bots spider the site based on popularity, authority, and how frequently content is updated. Some sites may be crawled several times a day while others may only be crawled a few times a week. Google Webmaster Central provides a way to see when Googlebot last accessed the robots.txt file.
I’d recommend using the robots.txt analysis tool in Google Webmaster Central to check specific URLs to see if your robots.txt file allows or blocks them, see if Googlebot had trouble parsing any lines in your robots.txt file, and test changes to your robots.txt file.
Some advanced techniques
The major search engines have begun working together to advance the functionality of the robots.txt file. As alluded to above, there are some functions that have been adopted by the major search engines, and not necessarily all of the major engines, that provide for finer control over crawling. As these may be limited though, do exercise caution in their use.
Crawl delay: Some websites may experience high amounts of traffic and would like to slow search engine spiders down to allow for more server resources to meet the demands of regular traffic. Crawl delay is a special directive recognized by Yahoo, Live Search, and Ask that instructs a crawler on the number of seconds to wait between crawling pages:
User-agent: msnbot
Crawl-delay: 5
Pattern matching: At this time, pattern matching appears to be usable by the three majors: Google, Yahoo, and Live Search. The value of pattern matching is considerable. Let’s look first at the most basic of pattern matching, using the asterisk wildcard character. To block access to all subdirectories that begin with “private”:
User-agent: Googlebot
Disallow: /private*/
You can match the end of the string using the dollar sign ($). For example, to block URLs that end with .asp:
User-agent: Googlebot
Disallow: /*.asp$
Unlike the more advanced pattern matching found in regular expressions in Perl and elsewhere, the question mark does not have special powers. So, to block access to all URLs that include a question mark (?), simply use the question mark (no need to “escape” it or precede it with a backslash):
User-agent: *
Disallow: /*?*
To block robots from crawling all files of a specific file type (for example, .gif):
User-agent: *
Disallow: /*.gif$
Here’s a more complicated example. Let’s say your site uses the query string part of the URLs (what follows the “?”) solely for session IDs, and you want to exclude all URLs that contain the dynamic parameter to ensure the bots don’t crawl duplicate pages. But you may want to include any URLs that end with a “?”. Here’s how you’d accomplish that:
User-agent: Slurp
Disallow: /*? # block any URL that includes a ?
Allow: /*?$ # allow any URL that ends in a ?
Allow directive: At this time, the Allow directive appears to only be supported by Google, Yahoo, and Ask. Just as it sounds, it works the opposite of the Disallow directive and provides the ability to specifically call out directories or pages that may be crawled. This may be beneficial after large sections or the entire site has been disallowed.
To allow Googlebot into only the “google” directory:
User-agent: Googlebot
Disallow: /
Allow: /google/
Noindex directive: As mentioned above, this directive offers benefits in eliminating snippetless title-less listings from the search results, but it’s limited to Google. Its syntax exactly mirrors Disallow. In the words of Matt Cutts:
“Google allows a NOINDEX directive in robots.txt and it will completely remove all matching site URLs from Google. (That behavior could change based on this policy discussion, of course, which is why we haven’t talked about it much.)”
Sitemap: An XML sitemap file can tell search engines about all the pages on your site, and optionally, to provide information about those pages, such as which are most important and how often they change. It acts as an auto-discovery mechanism for the spider to find the XML sitemap file. You can tell Google and other search engines about your Sitemap by adding the following line to your robots.txt file:
Sitemap: sitemap_location
The sitemap_location should be the complete URL to the Sitemap, such as: http://www.example.com/sitemap.xml. This directive is independent of the user-agent line, so it doesn’t matter where you place it in your file. All major search engines support the Auto-Discovery Sitemap protocol, including Google, Yahoo, Live Search, and Ask.
While auto-discovery provides a way to inform search engines about the sitemap.xml file, it’s also worthwhile verifying and submitting sitemaps directly to the search engines through each of their webmaster consoles (Google Webmaster Central, Yahoo Site Explorer, Live Search Webmaster Center).
More about Google’s bots
Google uses several different bots (user-agents). The bot for web search is Googlebot. Google’s other bots follow rules you set up for Googlebot, but you can set up additional rules for these specific bots as well. Blocking Googlebot blocks all bots that begin with “Googlebot”.
Here’s a list of Google robots:
You can block Googlebot entirely by using:
User-agent: Googlebot
Disallow: /
You can allow Googlebot, but block access to all other bots:
User-agent: *
Disallow: /
User-agent: Googlebot
Disallow:
Issues with robots.txt
Pages you block by using robots.txt disallows may still be in Google’s index and appear in the search results — especially if other sites link to them. Granted, a high ranking is pretty unlikely since Google can’t “see” the page content; it has very little to go on other than the anchor text of inbound and internal links, and the URL (and the ODP title and description if in ODP/DMOZ.) As a result, the URL of the page and, potentially, other publicly available information can appear in search results. However, no content from your pages will be crawled, indexed or displayed.
To entirely prevent a page from being added to a search engine’s index even if other sites link to it, use a “noindex” robots meta tag and ensure that the page is not disallowed in robots.txt. When spiders crawl the page, it will recognize the “noindex” meta tag and drop the URL from the index.
Robots.txt and robots meta tag conflicts
If the robots.txt file and robots meta tag instructions for a page conflict, bots follow the most restrictive. More specifically:
While robots.txt files are to protect content on a site from being indexed, including a robots.txt file regardless is recommended as many robotic processes look for them and offering one can only expedite their procedures. Together, robots.txt and robots meta tags give you the flexibility to express complex access policies relatively easily:
Both robots.txt and robots meta tag rely on cooperation from the robots, and are by no means guaranteed to work for every robot. If you need stronger protection from robots and other agents, you should use alternative methods such as password protection.
Opinions expressed in the article are those of the guest author and not necessarily Search Engine Land.
Share, Bookmark & Discuss This Article
More:
Keep Updated: News Via Email | News Via RSS Feed | News Via Twitter
See more stories like this in the Members Library! Check out the 100% Organic, How To: SEO, SEO: Blocking Spiders sections of the Members Library where this story is filed. Members also get access to exclusive video content, a members-only weekly & monthly newsletter, plus more. Check out all the benefits!
TOP STORIES
SEARCH NEWS BRIEFS
FEATURES & ANALYSIS
RECENT COMMENTS
Stay on top of all the search news with our daily summary, the SearchCap newsletter. View a sample ›
Search Engine Land produces SMX, the Search Marketing Expo conference series. SMX events deliver the most comprehensive educational and networking experiences - whether you're just starting in search marketing or you're a seasoned expert.
SMX Web Site » | SMX Difference » | SMX News »
Join us at an upcoming SMX event:
Learn more about search marketing with our free online webcasts and webinars from our sister site, Search Marketing Now. Upcoming online events include:
Featured sites from our Blogroll
Become a premium member today and receive:
This is probably formatting issue, but anyone new to robots.txt should bear in mind that each directive should start on new line, ie:
User-agent: Googlebot
Disallow:
If you just copy/paste directives as shown above you’d have incorrect disallow directives that won’t work.
“The rules in the robots.txt file will be in effect immediately once it is uploaded to the site’s root and the bot comes to the site.”
The rules will only be in effect when bots take them and interprete – this won’t happen immediately after change is made and this delay will depend on a search engine: it can easily be days.
I was doing some research on robots.txt and found your very informative post – thank you.
My initial search query surprised me and I ended up writing a blog post about it prompting me to think about how we use, or sometimes forget to use a robots.txt file in the most beneficial way.