MSNbot 1.1: Live Search Implements A More Efficient Crawl

Today, Microsoft announced changes to their Live Search crawler intended to reduce bandwidth resources during the crawl of a site. MSNbot (upgraded to version 1.1) now supports both HTTP compression and conditional get. The post on the Live Search Webmaster Center blog describes each feature in detail and includes links to tools you can use to check your server for support of these features.

  • HTTP compression enables search engine crawlers (and browsers) to compress files before downloading them.
  • Conditional get lets the crawler ask a server if the page has been changed since the last request (using the If-Modified-Since header). If the content hasn’t changed, a server that supports conditional get returns a 304 response (not modified). When the crawler gets this response, it doesn’t download the page contents (and continues to use the version already downloaded).

As the blog post notes, the other major search engines support these features as well.

Google Google overhauled their crawler to reduce bandwidth usage in 2006 as part of the “Bigdaddy” infrastructure change. With this effort, Googlebot increased support for HTTP compression and started using a crawl caching proxy. The Google webmaster help center describes Googlebot’s handling of conditional get, which is similar to MSNbot’s.

Yahoo! In 2005, Yahoo! announced support of both HTTP compression and conditional get.

Ask Ask’s webmaster documentation includes information about HTTP compression, although it doesn’t mention conditional get support.

Cache Dates

In 2006, Google changed how it displays cache dates of pages to reflect the most recent visit to the page, rather than the most recent download of the page. Live Search matches Google’s current functionality, showing the last time MSNbot visited the page as the cache date. Yahoo! doesn’t display a cached date.

Other Ways To Reduce Search Engine Crawler Bandwidth

If search engine crawlers use too much bandwidth on your site, even once your server has HTTP compression and conditional get turned on, you can use additional methods to reduce bandwidth consumption. However, keep in mind that unlike HTTP compression and conditional get, these other methods could potentially reduce the number of indexed pages.

Crawl Delay Live Search, Yahoo!, and Ask all support the crawl-delay instruction in robots.txt (Google is the lone holdout). You specify the crawl-delay in seconds, which indicates how long the crawler should wait between page fetches.

A robots.txt file that directs all crawlers to wait five seconds between each page fetch looks as follows:

user-agent: * crawl-delay: 5

The Live Search webmaster help notes that the news crawler doesn’t follow the crawl-delay instruction:

“Live Search also uses a dedicated crawler to crawl certain types of sites at high frequency. The msnbot-NewsBlogs/1.0 news crawler helps provide current results for our news site. The msnbot-NewsBlogs/1.0 does not adhere to the crawl-delay settings.

If you find that MSNBot is still placing too high a load on your web server, contact Site Owner Support.”

In a recent interview, Matt Cutts of Google explained that Google doesn’t support crawl delay.

“I believe the only robots.txt extension in common use that Google doesn’t support is the crawl-delay. And, the reason that Google doesn’t support crawl-delay is because way too many people accidentally mess it up. For example, they set crawl-delay to a hundred thousand, and, that means you get to crawl one page every other day or something like that.
We have even seen people who set a crawl-delay such that we’d only be allowed to crawl one page per month. What we have done instead is provide throttling ability within Webmaster Central, but crawl-delay is the inverse; it’s saying crawl me once every “n” seconds. In fact what you really want is host-load, which lets you define how many Googlebots are allowed to crawl your site at once. So, a host-load of two would mean, 2 Googlebots are allowed to be crawling the site at once.”

Google’s Webmaster Central’s crawl rate feature provides information about Googlebot’s current bandwidth usage and enables webmasters to request a slower crawl (and in some cases, a faster crawl).

Using robots.txt To Reduce Bandwidth You can block pages or directories from being crawled to reduce overall bandwidth. If you have large portions of your site that you don’t want (or need) indexed, you can use robots.txt to block search engine crawlers from accessing them. Note that use of robots meta tags will keep the pages out of the index, but won’t achieve the bandwidth reduction goals, as the crawlers have to access the pages to read the meta tags.

Many webmasters assumed MSNbot already supported HTTP compression and conditional get, although some had criticized Live Search for using more bandwidth than other search engine crawlers. With these enhancements, webmasters who have these features enabled on their servers should notice a bandwidth reduction.

Opinions expressed in the article are those of the guest author and not necessarily Search Engine Land.

Related Topics: Channel: SEO | Microsoft: Bing SEO

Sponsored


About The Author: is a Contributing Editor at Search Engine Land. She built Google Webmaster Central and went on to found software and consulting company Nine By Blue and create Blueprint Search Analytics< which she later sold. Her book, Marketing in the Age of Google, (updated edition, May 2012) provides a foundation for incorporating search strategy into organizations of all levels. Follow her on Twitter at @vanessafox.

Connect with the author via: Email | Twitter | Google+ | LinkedIn



SearchCap:

Get all the top search stories emailed daily!  

Share

Other ways to share:
 

Read before commenting! We welcome constructive comments and allow any that meet our common sense criteria. This means being respectful and polite to others. It means providing helpful information that contributes to a story or discussion. It means leaving links only that substantially add further to a discussion. Comments using foul language, being disrespectful to others or otherwise violating what we believe are common sense standards of discussion will be deleted. Comments may also be removed if they are posted from anonymous accounts. You can read more about our comments policy here.

Comments are closed.

Get Our News, Everywhere!

Daily Email:

Follow Search Engine Land on Twitter @sengineland Like Search Engine Land on Facebook Follow Search Engine Land on Google+ Get the Search Engine Land Feed Connect with Search Engine Land on LinkedIn Check out our Tumblr! See us on Pinterest

 
 

Click to watch SMX conference video

Join us at one of our SMX or MarTech events:

United States

Europe

Australia & China

Learn more about: SMX | MarTech


Free Daily Search News Recap!

SearchCap is a once-per-day newsletter update - sign up below and get the news delivered to you!

 


 

Search Engine Land Periodic Table of SEO Success Factors

Get Your Copy
Read The Full SEO Guide