Today, Microsoft announced changes to their Live Search crawler intended to reduce bandwidth resources during the crawl of a site. MSNbot (upgraded to version 1.1) now supports both HTTP compression and conditional get. The post on the Live Search Webmaster Center blog describes each feature in detail and includes links to tools you can use to check your server for support of these features.
- HTTP compression enables search engine crawlers (and browsers) to compress files before downloading them.
- Conditional get lets the crawler ask a server if the page has been changed since the last request (using the If-Modified-Since header). If the content hasn’t changed, a server that supports conditional get returns a 304 response (not modified). When the crawler gets this response, it doesn’t download the page contents (and continues to use the version already downloaded).
As the blog post notes, the other major search engines support these features as well.
Google Google overhauled their crawler to reduce bandwidth usage in 2006 as part of the “Bigdaddy” infrastructure change. With this effort, Googlebot increased support for HTTP compression and started using a crawl caching proxy. The Google webmaster help center describes Googlebot’s handling of conditional get, which is similar to MSNbot’s.
Yahoo! In 2005, Yahoo! announced support of both HTTP compression and conditional get.
Ask Ask’s webmaster documentation includes information about HTTP compression, although it doesn’t mention conditional get support.
In 2006, Google changed how it displays cache dates of pages to reflect the most recent visit to the page, rather than the most recent download of the page. Live Search matches Google’s current functionality, showing the last time MSNbot visited the page as the cache date. Yahoo! doesn’t display a cached date.
Other Ways To Reduce Search Engine Crawler Bandwidth
If search engine crawlers use too much bandwidth on your site, even once your server has HTTP compression and conditional get turned on, you can use additional methods to reduce bandwidth consumption. However, keep in mind that unlike HTTP compression and conditional get, these other methods could potentially reduce the number of indexed pages.
Crawl Delay Live Search, Yahoo!, and Ask all support the crawl-delay instruction in robots.txt (Google is the lone holdout). You specify the crawl-delay in seconds, which indicates how long the crawler should wait between page fetches.
A robots.txt file that directs all crawlers to wait five seconds between each page fetch looks as follows:
user-agent: * crawl-delay: 5
The Live Search webmaster help notes that the news crawler doesn’t follow the crawl-delay instruction:
“Live Search also uses a dedicated crawler to crawl certain types of sites at high frequency. The msnbot-NewsBlogs/1.0 news crawler helps provide current results for our news site. The msnbot-NewsBlogs/1.0 does not adhere to the crawl-delay settings.
If you find that MSNBot is still placing too high a load on your web server, contact Site Owner Support.”
In a recent interview, Matt Cutts of Google explained that Google doesn’t support crawl delay.
“I believe the only robots.txt extension in common use that Google doesn’t support is the crawl-delay. And, the reason that Google doesn’t support crawl-delay is because way too many people accidentally mess it up. For example, they set crawl-delay to a hundred thousand, and, that means you get to crawl one page every other day or something like that.
We have even seen people who set a crawl-delay such that we’d only be allowed to crawl one page per month. What we have done instead is provide throttling ability within Webmaster Central, but crawl-delay is the inverse; it’s saying crawl me once every “n” seconds. In fact what you really want is host-load, which lets you define how many Googlebots are allowed to crawl your site at once. So, a host-load of two would mean, 2 Googlebots are allowed to be crawling the site at once.”
Google’s Webmaster Central’s crawl rate feature provides information about Googlebot’s current bandwidth usage and enables webmasters to request a slower crawl (and in some cases, a faster crawl).
Using robots.txt To Reduce Bandwidth You can block pages or directories from being crawled to reduce overall bandwidth. If you have large portions of your site that you don’t want (or need) indexed, you can use robots.txt to block search engine crawlers from accessing them. Note that use of robots meta tags will keep the pages out of the index, but won’t achieve the bandwidth reduction goals, as the crawlers have to access the pages to read the meta tags.
Many webmasters assumed MSNbot already supported HTTP compression and conditional get, although some had criticized Live Search for using more bandwidth than other search engine crawlers. With these enhancements, webmasters who have these features enabled on their servers should notice a bandwidth reduction.