A Deeper Look At Robots.txt
The Robots Exclusion Protocol (REP) is not exactly a complicated protocol and its uses are fairly limited, and thus it’s usually given short shrift by SEOs. Yet there’s a lot more to it than you might think. Robots.txt has been with us for over 14 years, but how many of us knew that in addition […]
The Robots Exclusion Protocol (REP) is not exactly a complicated protocol and its uses are fairly limited, and thus it’s usually given short shrift by SEOs. Yet there’s a lot more to it than you might think. Robots.txt has been with us for over 14 years, but how many of us knew that in addition to the disallow directive there’s a noindex directive that Googlebot obeys? That noindexed pages don’t end up in the index but disallowed pages do, and the latter can show up in the search results (albeit with less information since the spiders can’t see the page content)? That disallowed pages still accumulate PageRank? That robots.txt can accept a limited form of pattern matching? That, because of that last feature, you can selectively disallow not just directories but also particular filetypes (well, file extensions to be more exact)? That a robots.txt disallowed page can’t be accessed by the spiders, so they can’t read and obey a meta robots tag contained within the page?
A robots.txt file provides critical information for search engine spiders that crawl the web. Before these bots (does anyone say the full word “robots” anymore?) access pages of a site, they check to see if a robots.txt file exists. Doing so makes crawling the web more efficient, because the robots.txt file keeps the bots from accessing certain pages that should not be indexed by the search engines.
Having a robots.txt file is a best practice. Even just for the simple reason that some metrics programs will interpret the 404 response to the request for a missing robots.txt file as an error, which could result in erroneous performance reporting. But what goes in that robots.txt file? That’s the crux of it.
Both robots.txt and robots meta tags rely on cooperation from the robots, and are by no means guaranteed to work for every bot. If you need stronger protection from unscrupulous robots and other agents, you should use alternative methods such as password protection. Too many times I’ve seen webmasters naively place sensitive URLs such as administrative areas in robots.txt. You better believe robots.txt is one of the hacker’s first ports of call—to see where they should break into.
Robots.txt works well for:
- Barring crawlers from non-public parts of your website
- Barring search engines from trying to index scripts, utilities, or other types of code
- Avoiding the indexation of duplicate content on a website, such as “print” versions of html pages
- Auto-discovery of XML Sitemaps
At the risk of being Captain Obvious, the robots.txt file must reside in the root of the domain and must be named “robots.txt” (all lowercase). A robots.txt file located in a subdirectory isn’t valid, as bots only check for this file in the root of the domain.
Creating a robots.txt file is easy. You can create a robots.txt file in any text editor. It should be an ASCII-encoded text file, not an HTML file.
- User-Agent: the robot the following rule applies to (e.g. “Googlebot,” etc.)
- Disallow: the pages you want to block the bots from accessing (as many disallow lines as needed)
- Noindex: the pages you want a search engine to block AND not index (or de-index if previously indexed). Unofficially supported by Google; unsupported by Yahoo and Live Search.
- Each User-Agent/Disallow group should be separated by a blank line; however no blank lines should exist within a group (between the User-agent line and the last Disallow).
- The hash symbol (#) may be used for comments within a robots.txt file, where everything after # on that line will be ignored. May be used either for whole lines or end of lines.
- Directories and filenames are case-sensitive: “private”, “Private”, and “PRIVATE” are all uniquely different to search engines.
Let’s look at an example robots.txt file. The example below includes:
- The robot called “Googlebot” has nothing disallowed and may go anywhere
- The entire site is closed off to the robot called “msnbot”;
- All robots (other than Googlebot) should not visit the /tmp/ directory or directories or files called /logs, as explained with comments, e.g., tmp.htm, /logs or logs.php.
# Block all robots from tmp and logs directories
Disallow: /logs # for directories and files called logs
What should be listed on the User-Agent line? A user-agent is the name of a specific search engine robot. You can set an entry to apply to a specific bot (by listing the name) or you can set it to apply to all bots (by listing an asterisk, which acts as a wildcard). An entry that applies to all bots looks like this:
Major robots include: Googlebot (Google), Slurp (Yahoo!), msnbot (MSN), and TEOMA (Ask).
Bear in mind that a block of directives specified for the user-agent of Googlebot will be obeyed by Googlebot; but Googlebot will NOT ALSO obey the directives for the user-agent of * (all bots).
What should be listed on the Disallow line? The disallow lists the pages you want to block. You can list a specific URL or a pattern. The entry should begin with a forward slash (/).
- To block the entire site:
- To block a directory and everything in it:
- To block a page:
- To block a page and/or a directory named private:
If you serve content via both http and https, you’ll need a separate robots.txt file for each of these protocols. For example, to allow robots to index all http pages but no https pages, you’d use the robots.txt files as follows, for your http protocol:
And for the https protocol:
Bots check for the robots.txt file each time they come to a website. The rules in the robots.txt file will be in effect immediately once it is uploaded to the site’s root and the bot comes to the site. How often it is accessed varies on how frequently the bots spider the site based on popularity, authority, and how frequently content is updated. Some sites may be crawled several times a day while others may only be crawled a few times a week. Google Webmaster Central provides a way to see when Googlebot last accessed the robots.txt file.
I’d recommend using the robots.txt analysis tool in Google Webmaster Central to check specific URLs to see if your robots.txt file allows or blocks them, see if Googlebot had trouble parsing any lines in your robots.txt file, and test changes to your robots.txt file.
Some advanced techniques
The major search engines have begun working together to advance the functionality of the robots.txt file. As alluded to above, there are some functions that have been adopted by the major search engines, and not necessarily all of the major engines, that provide for finer control over crawling. As these may be limited though, do exercise caution in their use.
Crawl delay: Some websites may experience high amounts of traffic and would like to slow search engine spiders down to allow for more server resources to meet the demands of regular traffic. Crawl delay is a special directive recognized by Yahoo, Live Search, and Ask that instructs a crawler on the number of seconds to wait between crawling pages:
Pattern matching: At this time, pattern matching appears to be usable by the three majors: Google, Yahoo, and Live Search. The value of pattern matching is considerable. Let’s look first at the most basic of pattern matching, using the asterisk wildcard character. To block access to all subdirectories that begin with “private”:
You can match the end of the string using the dollar sign ($). For example, to block URLs that end with .asp:
Unlike the more advanced pattern matching found in regular expressions in Perl and elsewhere, the question mark does not have special powers. So, to block access to all URLs that include a question mark (?), simply use the question mark (no need to “escape” it or precede it with a backslash):
To block robots from crawling all files of a specific file type (for example, .gif):
Here’s a more complicated example. Let’s say your site uses the query string part of the URLs (what follows the “?”) solely for session IDs, and you want to exclude all URLs that contain the dynamic parameter to ensure the bots don’t crawl duplicate pages. But you may want to include any URLs that end with a “?”. Here’s how you’d accomplish that:
Disallow: /*? # block any URL that includes a ?
Allow: /*?$ # allow any URL that ends in a ?
Allow directive: At this time, the Allow directive appears to only be supported by Google, Yahoo, and Ask. Just as it sounds, it works the opposite of the Disallow directive and provides the ability to specifically call out directories or pages that may be crawled. This may be beneficial after large sections or the entire site has been disallowed.
To allow Googlebot into only the “google” directory:
Noindex directive: As mentioned above, this directive offers benefits in eliminating snippetless title-less listings from the search results, but it’s limited to Google. Its syntax exactly mirrors Disallow. In the words of Matt Cutts:
“Google allows a NOINDEX directive in robots.txt and it will completely remove all matching site URLs from Google. (That behavior could change based on this policy discussion, of course, which is why we haven’t talked about it much.)”
Sitemap: An XML sitemap file can tell search engines about all the pages on your site, and optionally, to provide information about those pages, such as which are most important and how often they change. It acts as an auto-discovery mechanism for the spider to find the XML sitemap file. You can tell Google and other search engines about your Sitemap by adding the following line to your robots.txt file:
The sitemap_location should be the complete URL to the Sitemap, such as: https://www.example.com/sitemap.xml. This directive is independent of the user-agent line, so it doesn’t matter where you place it in your file. All major search engines support the Auto-Discovery Sitemap protocol, including Google, Yahoo, Live Search, and Ask.
While auto-discovery provides a way to inform search engines about the sitemap.xml file, it’s also worthwhile verifying and submitting sitemaps directly to the search engines through each of their webmaster consoles (Google Webmaster Central, Yahoo Site Explorer, Live Search Webmaster Center).
More about Google’s bots
Google uses several different bots (user-agents). The bot for web search is Googlebot. Google’s other bots follow rules you set up for Googlebot, but you can set up additional rules for these specific bots as well. Blocking Googlebot blocks all bots that begin with “Googlebot”.
Here’s a list of Google robots:
- Googlebot: crawls pages from web index and news index
- Googlebot-Mobile: crawls pages for mobile index
- Googlebot-Image: crawls pages for image index
- Mediapartners-Google: crawls pages to determine AdSense content, only crawls sites if show AdSense ads
- Adsbot-Google: crawls to measure AdWords landing page quality, only crawls sites that use Google AdWords to advertise
You can block Googlebot entirely by using:
You can allow Googlebot, but block access to all other bots:
Issues with robots.txt
Pages you block by using robots.txt disallows may still be in Google’s index and appear in the search results — especially if other sites link to them. Granted, a high ranking is pretty unlikely since Google can’t “see” the page content; it has very little to go on other than the anchor text of inbound and internal links, and the URL (and the ODP title and description if in ODP/DMOZ.) As a result, the URL of the page and, potentially, other publicly available information can appear in search results. However, no content from your pages will be crawled, indexed or displayed.
To entirely prevent a page from being added to a search engine’s index even if other sites link to it, use a “noindex” robots meta tag and ensure that the page is not disallowed in robots.txt. When spiders crawl the page, it will recognize the “noindex” meta tag and drop the URL from the index.
Robots.txt and robots meta tag conflicts
If the robots.txt file and robots meta tag instructions for a page conflict, bots follow the most restrictive. More specifically:
- If you block a page with robots.txt, bots will never crawl the page and will never read any robots meta tags on the page.
- If you allow a page with robots.txt but block it from being indexed using a robots meta tag, Googlebot will access the page, read the meta tag, and subsequently not index it.
While robots.txt files are to protect content on a site from being indexed, including a robots.txt file regardless is recommended as many robotic processes look for them and offering one can only expedite their procedures. Together, robots.txt and robots meta tags give you the flexibility to express complex access policies relatively easily:
- Removing an entire website or part of a website.
- Avoiding indexation of images in Google Image Search and other image engines.
- Avoiding indexation of duplicate content on a site.
- Removing individual pages on a site using a robots Meta tag.
- Removing cached copies and snippets using a robots Meta tag.
Both robots.txt and robots meta tag rely on cooperation from the robots, and are by no means guaranteed to work for every robot. If you need stronger protection from robots and other agents, you should use alternative methods such as password protection.
Opinions expressed in this article are those of the guest author and not necessarily Search Engine Land. Staff authors are listed here.