MSNbot 1.1: Live Search Implements A More Efficient Crawl


Today, Microsoft announced changes to their Live Search crawler intended to reduce bandwidth resources during the crawl of a site. MSNbot (upgraded to version 1.1) now supports both HTTP compression and conditional get. The post on the Live Search Webmaster Center blog describes each feature in detail and includes links to tools you can use to check your server for support of these features.

  • HTTP compression enables search engine crawlers (and browsers) to compress files before downloading them.
  • Conditional get lets the crawler ask a server if the page has been changed since the last request (using the If-Modified-Since header). If the content hasn’t changed, a server that supports conditional get returns a 304 response (not modified). When the crawler gets this response, it doesn’t download the page contents (and continues to use the version already downloaded).

As the blog post notes, the other major search engines support these features as well.

Google Google overhauled their crawler to reduce bandwidth usage in 2006 as part of the “Bigdaddy” infrastructure change. With this effort, Googlebot increased support for HTTP compression and started using a crawl caching proxy. The Google webmaster help center describes Googlebot’s handling of conditional get, which is similar to MSNbot’s.

Yahoo! In 2005, Yahoo! announced support of both HTTP compression and conditional get.

Ask Ask’s webmaster documentation includes information about HTTP compression, although it doesn’t mention conditional get support.

Cache Dates

In 2006, Google changed how it displays cache dates of pages to reflect the most recent visit to the page, rather than the most recent download of the page. Live Search matches Google’s current functionality, showing the last time MSNbot visited the page as the cache date. Yahoo! doesn’t display a cached date.

Other Ways To Reduce Search Engine Crawler Bandwidth

If search engine crawlers use too much bandwidth on your site, even once your server has HTTP compression and conditional get turned on, you can use additional methods to reduce bandwidth consumption. However, keep in mind that unlike HTTP compression and conditional get, these other methods could potentially reduce the number of indexed pages.

Crawl Delay Live Search, Yahoo!, and Ask all support the crawl-delay instruction in robots.txt (Google is the lone holdout). You specify the crawl-delay in seconds, which indicates how long the crawler should wait between page fetches.

A robots.txt file that directs all crawlers to wait five seconds between each page fetch looks as follows:

user-agent: * crawl-delay: 5

The Live Search webmaster help notes that the news crawler doesn’t follow the crawl-delay instruction:

“Live Search also uses a dedicated crawler to crawl certain types of sites at high frequency. The msnbot-NewsBlogs/1.0 news crawler helps provide current results for our news site. The msnbot-NewsBlogs/1.0 does not adhere to the crawl-delay settings.

If you find that MSNBot is still placing too high a load on your web server, contact Site Owner Support.”

In a recent interview, Matt Cutts of Google explained that Google doesn’t support crawl delay.

“I believe the only robots.txt extension in common use that Google doesn’t support is the crawl-delay. And, the reason that Google doesn’t support crawl-delay is because way too many people accidentally mess it up. For example, they set crawl-delay to a hundred thousand, and, that means you get to crawl one page every other day or something like that.
We have even seen people who set a crawl-delay such that we’d only be allowed to crawl one page per month. What we have done instead is provide throttling ability within Webmaster Central, but crawl-delay is the inverse; it’s saying crawl me once every “n” seconds. In fact what you really want is host-load, which lets you define how many Googlebots are allowed to crawl your site at once. So, a host-load of two would mean, 2 Googlebots are allowed to be crawling the site at once.”

Google’s Webmaster Central’s crawl rate feature provides information about Googlebot’s current bandwidth usage and enables webmasters to request a slower crawl (and in some cases, a faster crawl).

Using robots.txt To Reduce Bandwidth You can block pages or directories from being crawled to reduce overall bandwidth. If you have large portions of your site that you don’t want (or need) indexed, you can use robots.txt to block search engine crawlers from accessing them. Note that use of robots meta tags will keep the pages out of the index, but won’t achieve the bandwidth reduction goals, as the crawlers have to access the pages to read the meta tags.

Many webmasters assumed MSNbot already supported HTTP compression and conditional get, although some had criticized Live Search for using more bandwidth than other search engine crawlers. With these enhancements, webmasters who have these features enabled on their servers should notice a bandwidth reduction.



Vanessa Fox is a Contributing Editor at Search Engine Land. Called a “cyberspace visionary” by Seattle Business Monthly, she is an expert in understanding customer acquisition from organic search. She shares her perspective on how this impacts marketing and user experience at ninebyblue.com and provides authoritative search-friendly design patterns for developers at janeandrobot.com.

See more articles by Vanessa Fox >


Share, Bookmark & Discuss This Article
More:


Keep Updated: News Via Email | News Via RSS Feed | News Via Twitter


See more stories like this in the Members Library! Check out the Microsoft: Bing SEO sections of the Members Library where this story is filed. Members also get access to exclusive video content, a members-only weekly & monthly newsletter, plus more. Check out all the benefits!

Comments are closed.


RECENT COMMENTS

  • kloeprich said " The recent news confirms suspicions I’ve had that News Corp and MS were already in negotiations with"
  • Susannah said " I can't wait to try some of these tips this week. What a resource! It's like having a coffee with 21"
  • dian said " I haven't tried that yet but if it is the way Mazter is saying I think it won't going to do any good"

See All »


FREE DAILY SEARCH NEWS RECAP!

Stay on top of all the search news with our daily summary, the SearchCap newsletter. View a sample ›

STAY CURRENT THROUGHOUT THE DAY

RSS Feeds

The Search Engine Land feed keeps you informed as news happens. SEE ALL FEEDS »

Upcoming Search Engine Land Conferences

Advertise With Us »

Search Engine Land produces SMX, the Search Marketing Expo conference series. SMX events deliver the most comprehensive educational and networking experiences - whether you're just starting in search marketing or you're a seasoned expert.


SMX Web Site » | SMX Difference » | SMX News »


Join us at an upcoming SMX event:

Search Marketing Now Learn more about search marketing with our free online webcasts and webinars from our sister site, Search Marketing Now. Upcoming online events include:


See more webcast topics »

TRACK US SOCIALLY
Upcoming Search Engine Land Conferences

Get Your Search Engine Land
Premium Membership!

Become a premium member today and receive:

  • Express commenting privileges & photo.
  • Exclusive videos & newsletters.
  • Discounts to our SMX conferences.
  • Access to "How To" & Other Archives.

Learn More

Upcoming Search Engine Land Conferences
Add to GoogleAdd to My Yahoo!Add to BloglinesAdd to NetvibesAdd to Windows Live