Google Webmaster Tools Revamps Crawl Errors, But Is It For The Better?
Google has just revamped the crawl errors data available in webmaster tools. Crawl errors are issues Googlebot encountered while crawling your site, so useful stuff! I originally started this article by writing that in most cases, these changes are for the better and in only a few (really maddening) cases, useful functionality has been removed. […]
Google has just revamped the crawl errors data available in webmaster tools. Crawl errors are issues Googlebot encountered while crawling your site, so useful stuff! I originally started this article by writing that in most cases, these changes are for the better and in only a few (really maddening) cases, useful functionality has been removed. But now that I’ve gone through the changes, I unfortunately need to revise my summary. This update is mostly about removing super useful data, masked by a few user interface changes. (And I hate to write that, because webmaster tools is near and dear to my heart.)
Update 3/17/12:After talking with Google, I’ve learned that most of what I was disappointed to find had been removed and that I feel is useful detail for power users is in fact still available through the API! I’ve dug into the details and have written up my findings. I’ve also updated this story with additional details from Google:
- Access denied errors include 401, 403, and 407. That some of these were showing up as “other” was a bug that has since been fixed.
- Not followed errors are indeed URLs returned either a 301 or 302 and Googlebot had trouble crawling that redirect due to an issue.
So what’s changed?
Site vs. URL Errors
Site errors are categorized as:
- DNS – These errors include things like DNS lookup timeout, domain name not found, and DNS error. (Although these specifics are no longer listed, as described more below.)
- Server Connectivity – The errors include things like network unreachable, no response, connection refused, and connection reset. (These specifics are also no longer listed.)
- Robots.txt Fetch – These errors are specific to the robots.txt file. If Googlebot receives a server error when trying to access this file, they have no way of knowing if a robots.txt file exists, and if so, what pages it blocks, so they stop the crawl until they no longer get an error when attempting to fetch it.
- Server error – These are 5xx errors (such as 503 for server maintenance)
- Soft 404 – These are URLs that are detected as returning an error page but don’t return a 404 response code (they typically have a response code of 200 or 301/302). Error pages that don’t return a 404 can hurt crawl efficiency as Googlebot can end up crawling these pages instead of valid pages you want indexed. In addition, these pages can end up in search results, which is not an ideal searcher experience.
- Access denied -These are URLs that returned a 401, 403, or 407 response code. Often this simply means that the URLs prompt for a login, which is likely not an error. You may, however, want to block these URLs from crawling to improve crawl efficiency.
- Not found – Typically, these are URLs that return a 404 or 410.
- Not followed – (updated) These are URLs that triggered redirects that Googlebot had trouble crawling (for instance, because of a redirect loop). The UI lists whether the URL initially returned a 301 or 302, but doesn’t provide the details of the redirect error.
- Other – This is a catch-all that includes all other errors.
Trends Over Time
Google now shows trends over the last 90 days for each error type. The daily count seems to be the aggregate count of how many URLs with that error type Google knows about, not the number crawled that particular day. As Google recrawls a URL and no longer gets the error, it’s removed from the list (and the count). In addition, Google still lists the date Googlebot first encountered the error, but now when you click the URL to see the details, you can see the last time Googlebot tried to access the URL as well.
Priorities and Fixed Status
Google says they are now listing URLs in priority order, based on a “multitude” of factors, including whether or not you can fix the problem, if the URL is listed in your Sitemap, if it gets a lot of traffic, and how many links it has. You can mark a URL as fixed and remove it from the list. However, once Google recrawls that page, if the error still exists, it will return to the list.
Google suggests using the Fetch as Googlebot feature to test your fix (and in fact now has a button right on the details page to do so), but since you are allowed only 500 fetches per account (not per site) each week (which I believe has increased from the previous limit), you should use this functionality judiciously.
What’s Gone Missing?
Unfortunately, several pieces of important functionality have been lost with this change.
- Ability to download all crawl error sources. Previously, you could download a CSV file that listed URLs that returned an error along with the pages that linked to those URLs. You could then sort that CSV by linking source to find broken links within your site and had an easy list of sites to contact to fix links to important pages of your site. Now, the only way to access this information is to click on an individual URL to view its details, then click the Linked From tab. There seems to be no way to download this data, even at the individual URL level. (Update 3/17/12: This detail is still available from the API-based crawl errors feed.)
- 100K URLs of each type. Previously, you could download up to 100,000 URLs with each type of error. Now, both the display and download are limited to 1,000. Google says “less is more” and “there was no realistic way to view all 100,000 errors—no way to sort, search, or mark your progress.” Google is wrong. There were absolutely realistic ways to view, sort, search, and mark your progress. The CSV download made all of this easy using Excel. And more data is always better to see patterns, especially for large scale sites with multiple servers, content management systems, and page templates. A lot has been lost here. (Update 3/17/12: 100k URLs for each error is still available from the API-based crawl errors feed and API-based CSV download.)
- Redirect errors – Inexplicably, the “not followed” errors no longer seem to list errors like redirect loop and too many redirects. Instead it simply lists the response code returned (301 or 302). This seems weird to me (not to mention extraordinarily less useful) as 301s are followed just fine and typically aren’t an error at all (and 302s are only sometimes problematic), but all the redirect errors that used to be listed are critical to know about and fix. Listing URLs that return a 301 status code as “not followed” is misleading and alarming for no reason. And if this list of URLs is actually those with redirect errors, then omitting what that error is (such as too many redirects) makes this data incredibly non-useful. (Update 3/17/12: Confirmed with Google that is a list of URLs that return either a 301 or 302 that subsequently Googlebot is unable to crawl. The specific issue is still available from the API-based crawl errors feed and API-based CSV download.)
- Specifics about soft 404s. The soft 404 report used to specify whether the URLs listed returned a 200 status code or redirected to an error page. But the status code column appears to be empty now. (Update 3/17/12: This detail is still available from the API-based crawl errors feed and API-based CSV download.)
- URLs blocked by robots.txt . Google says they removed this report because “while these can sometimes be useful for diagnosing a problem with your robots.txt file, they are frequently pages youintentionally blocked”. They say that similar information will soon be available in the crawler access section of webmaster tools. Why remove data you’re planning to replace before replacing it? Couldn’t they have just moved this report to the crawler access section? I get the feeling that they won’t be replacing this report as is, but providing less granular data in its place. While it’s true that this report didn’t list errors necessarily, it was very useful. You could skim the CSV to see if any sections of pages you expected to be indexed were blocked. And it was critical for diagnosis. Why aren’t certain pages indexed? You could check this report before spending extensive time debugging the issue. But now you can’t do either of those things. (Update 3/17/12: This report is still available from the API-based crawl errors feed and API-based CSV download.)
- Specifics about site level errors. The previous version of these reports listed the specific problem (such as DNS lookup timeout or domain name not found). That was very helpful in digging into what was going on. Now, you only get the count for the general category, not the specifics of what kind of error it was within that category. (Update 3/17/12: This detail is still available from the API-based crawl errors feed and API-based CSV download.)
- Specific URLs with “site” level errors. Google says you don’t need to know the URL if the issue was at the site level. Mostly, this is likely true. But I’ve definitely encountered cases, particularly with DNS errors, that the error only happened with specific URLs, not the entire site. Knowing the URL that triggered the error would help track down issues in these cases. (Update 3/17/12: This detail is still available from the API-based crawl errors feed and API-based CSV download.)
(Update 3/17/12: I got lots of additional detail form Google and as noted above, am happy to report that I was at least partially wrong — most of this data is still available through the API. Power users who want this level of detail are likely to prefer the API anyway, so my disappoint has lessened.)
As for my comment in the earlier version of this story where I said that “I get the sense that many of these recent changes are designed to make the data easier for small site owners to use, and don’t really have the large enterprise-level site (or agency) in mind. For these latter organizations, more data is better, as we have systems to parse and crunch the data”, Google has told me:
“Our strategy for Webmaster Tools is to improve the web interface and provide important, actionable, and useful information. Our changes are designed to improve the experience for all of our users, including power users. For example, we made changes to have crawl errors going back 90 days and to show the full aggregate count of URL errors instead of just the previous 100,000 cap. Power users can still access the firehose of data through our original GData API. One of the improvements we made is to now display the full count of URL errors, and that should help give more accurate data to larger sites. For example, previously if one site has over 35 million Not Found errors, that number would have been capped and shown as 100,000 errors. Now, that site can see the new number and even see where the increase happened in the historical data. We think that’s a big improvement.”
The point about the total number of errors shown is certainly a good one. Very large sites are likely to have more than 100k errors, and knowing the significance of the problem is helpful in prioritizing.
Of course, in part, I’m sad to see features that I worked hard on launching when I was product manager for webmaster central be dismantled and made less useful. But mostly, as a frequent user of the product, I don’t want to lose useful functionality. Update 3/17/12: As noted above, I’m happy that a lot of this functionality is still available through the API. Read on to my dive into how to access these details through the API.
Opinions expressed in this article are those of the guest author and not necessarily Search Engine Land. Staff authors are listed here.
New on Search Engine Land