Up Close & Personal With Robots.txt
The Robots.txt Summit at Search Engine Strategies New York 2007 was the latest in a series of special sessions with the intent to open a dialog between search engines representatives and web site publishers. Past summits featured discussion on comment spam on blogs, indexing issues and redirects. The subject of this latest summit was to discuss the humble but terribly important robots.txt file.
Danny Sullivan moderated, with panelists Keith Hogan, Director of Program Management, Search Technology, Ask.com, Sean Suchter, Director of Yahoo Search Technology, Yahoo Search, Dan Crow, Product Manager, Google and Eytan Seidman, Senior Program Manager Lead, Live Search. The Robots.txt summit session was not on how to use the robots.txt file, rather as Danny Sullivan explained, “We’re assuming you know how to use it and are frustrated with it. This is about how you want to see it evolve.”
For a potentially dry and technical subject, the panel turned out to be quite interesting, with panelists offering some valuable tips.
All engines on board with Sitemaps.org
One of the announcements that occurred during the week of SES was Ask.com joining Google, MSN and Yahoo in supporting the Sitemaps auto discovery. This feature allows webmasters to specify the location of their sitemaps within their robots.txt file. Keith Hogan of Ask.com mentioned this change in his presentation and its impact. This will eliminate the need to submit sitemaps to each engine separately. Essentially, sitemaps are a simple XML file that lists URLS and information about the URLS to help spiders do a better job of crawling a site. See www.sitemaps.org for more details.
Ask.com’s Keith Hogan further explained that both robots and sitemaps are related in that, both are intended to control the interaction with the crawler on your site. The sitemap focuses on the fine grain level—it identifies pages on your site, how old a page is, defines frequency of update, which paths are more important that others, etc. By contrast, the robots.txt simply tells the robot what pages not to index.
Misuse of Robots.txt abounds
Sadly, one of the recurring themes of the session was that most webmasters don’t use the file or use it incorrectly.
In his introductory presentation, Keith Hogan provided some quick facts on the robots.txt. He said less than 35% of servers have a robots.txt and that the majority of robots.txt files are copies from one found online or are provided by hosting site.
Keith showed an example of his favorite file that illustrates how the file is not well understood. He shows an example robots file showing a comment telling spiders to “Crawl during off peak hours”
Dan Crow of Google said they frequently get complaints that sites aren’t included in Google’s index, and on investigation Google finds the site has a robots.txt that prohibits indexing. Sean Suchter from Yahoo echoed that problem saying Yahoo frequently finds accidental exclusion as well.
Eytan Seldman of Live.com showed another example of misuse of a robots.txt file. He opened a live browser window and pulled up the robots.txt for hilton.com. The first two lines are comments directed to the search engines telling them NOT to visit during the day!
Danny showed a humorous use of the robots.txt file by Brett Tabke of Webmaster World, who actually uses his robots.txt as a blog.
Much of the session was open dialog between the engines and webmasters with Danny moderating and steering the session back on track. Here’s a summary of the main discussion points.
Time for Robots.txt to evolve?
The original concept for the Robots.txt file came about in 1994. It was updated in 1996, but there’s been no widespread movement to update the standard since then. The representatives from the engines wanted feedback on whether it’s time to change the format into XML or HTML.
Keith Hogan of Ask.com mentioned it might improve accuracy, control and understanding of the file. He envisioned a file where a webmaster could make crawler-allow groups, disallow groups, disallow paths, so that you should could make crawler actions make better sense.
Some audience members thought moving to XML had merit and went on to ask if it should be part of the sitemaps standard.
Audience member Dave Naylor posed an interesting question when he asked what if you have both XML and TXT files, which get priority? Dan Crow of Google replied that he was wary about going to XML format because of this issue. He also mentioned that people have a harder time developing an XML file, more problems than with a TXT file. He mentioned that he’s seen a number of invalid robots files including some robots files that contain jpeg graphics. He is concerned that malformed data is higher risk in XML files than in simple text files.
Should there be an authenticated crawl?
One member of the audience wanted to develop a way to verify that a robot is who it says it is, and the ability to authenticate a robot before allowing it to crawl a site. Danny asked the audience how many wanted authenticated crawling. Only a few hands went up.
Dan Crow said that any authentication method must be careful. Whereas all the search engines spiders obey the robots.txt, there are many spiders that will chose to ignore the file and crawl your site anyway.
Should we add controls that correspond to user patterns?
Some sites have peaks in the middle of the day, followed by lulls at night. Keith Hogan mentioned that they are actually seeing evidence that webmasters are changing out the robots.txt file during the day in an effort to control crawling and crawl rate. Danny asked how many wanted time zone control for crawlers visiting a site. A few hands went up.
One audience member mentioned that Google supports crawler rate though sitemaps. Dan Crow mentioned that Google already adjusts the rates of crawl based on a web server’s response time.
Sean Suchter of Yahoo requested feedback on rate control. He explained how Yahoo currently supports crawl delay, but it is frequently misused to the detriment of the site. He gave an example where a news site listed a crawl delay of 40. Sean explained that the 40 means 40 seconds—too short a time for Yahoo to crawl the large news site so the result is that Yahoo can never actually crawl the site at all.
Sean went on to say he would like to replace the existing method with different implementation that accomplishes the same goal for webmasters but is less error-prone. To develop this he needs to know what site owners are trying to control. Bandwidth reduction? GET reduction? Database load reduction? Keeping overall server load down?
Danny asked if it would be better to do it in megabits per day? The audience didn’t like that option. One audience member mentioned that sever load is the concern, but it is hard to express server load. Another member suggested using the number of open connections as a guide. Another suggested identifying a specific domain server for crawling and outlawing others.
Danny asked how many in the audience have had problem with the big engines shutting them down due to excessive server load. Only about four hands went up.
Danny asked how many wanted some time delay for control or how many want to get rid of time delay completely to override stupidity? There were a few hands and chuckles as a response. In the end it was estimated that only 10% of the audience said they needed time of day control.
Should the standardization of the protocol be revived?
Dan Crow of Google explained how the robots.txt was created in June 1994 and had become a de facto standard, and suggested it may be time to revive the standardization effort to develop common core features and to develop consistent syntax and an improved common feature set.
Partial page controls needed?
Historically the main purpose of robots.txt was to identify entire pages that shouldn’t be indexed. But now that pages have become increasingly complex there can be many components of a page that you may not want indexed. Danny and several of the engine representatives asked whether it was important to have the ability to tell the robots to not index specific section of pages.
For example, should content in navigation be excluded? Many hands were raised. Another audience member volunteered that his firm ran a financial site and is required to include legal verbiage on every page. They would like to be able to tell the engines to not index that part of their pages.
There appeared to be overwhelming support for development of a way to block certain parts of the page for indexing.
One audience member pondered the question if this were adopted and if a webmaster identified their navigation to be ignored, would the engines ignore the navigation content only and would the engine still follow the links. Sean Suchter said that even if navigation links were not indexed crawlers would still follow the links.
A side note worth mentioning was the discussion where it might be appropriate to tell the engines to not index duplicate content areas, or possible spider traps like session IDs and affiliate IDs. However, it may not be appropriate to disallow engines from indexing style sheets used to format a web page. Dan Crow explained that blocking access to indexing the CSS might give the appearance that you were abusing the CSS so he did not recommend disallowing engines from indexing your CSS file.
After the session, I wondered if this method of giving a webmaster the ability to tell the engine to ignore parts of the page could be abused and turned into a legal form of cloaking.
Is it optional to follow robots.txt?
This was a question by an audience member wondering if the search engines were bound to follow the robots.txt indexing requests. Dan Crow responded says that all the search engines follow robots.txt protocol although it is possible that other spiders do not.
If a page isn’t listed in the sitemaps can it still be indexed?
One member of the audience wanted clarification whether a page will be crawled if it was not listed in a sitemap.
Dan Crow of Google explained that even if your sitemap only lists part of the pages on the site, the engine will still continue to crawl your site as part of the standard process. Sitemaps are just extra information and do not replace crawling for discovering URLs.
Privacy and Robots.txt
One member of the audience voiced privacy concerns in the situation where the engines don’t index a page but nevertheless list its URL in search results. The member wanted the robots file to be private.
The engines responded as follows: Microsoft said it is looking at how to handle partial indexing. Yahoo said you can delete URLs and paths in site explorer. Google allows you to remove URLs. Google’s Vanessa Fox, who was in the audience, offered up another solution: Use the meta NOINDEX tag and Google won’t index the url.
Danny then took the question a step further and asked how many don’t want the URL being listed at all? He gave the example of the Library of Congress that had a robots.txt that kept the engines from indexing the site. Danny asked how many think the site owner is always right? Most hands went up.
The biggest take away of this session is that these “summits” work—they are an opportunity for web publishers to conduct open discussion with representatives of the engines. For an industry cloaked in secrecy and mistrust, this is a welcome change and another indication our industry is starting to mature. See you next summit.
Opinions expressed in this article are those of the guest author and not necessarily Search Engine Land. Staff authors are listed here.