Google Webmaster Tools Revamps Crawl Errors, But Is It For The Better?

Google has just revamped the crawl errors data available in webmaster tools. Crawl errors are issues Googlebot encountered while crawling your site, so useful stuff! I originally started this article by writing that in most cases, these changes are for the better and in only a few (really maddening) cases, useful functionality has been removed. But now that I’ve gone through the changes, I unfortunately need to revise my summary. This update is mostly about removing super useful data, masked by a few user interface changes. (And I hate to write that, because webmaster tools is near and dear to my heart.)

Update 3/17/12:After talking with Google, I’ve learned that most of what I was disappointed to find had been removed and that I feel is useful detail for power users is in fact still available through the API! I’ve dug into the details and have written up my findings. I’ve also updated this story with additional details from Google:

  • Access denied errors include 401, 403, and 407. That some of these were showing up as “other” was a bug that has since been fixed.
  • Not followed errors are indeed URLs returned either a 301 or 302 and Googlebot had trouble crawling that redirect due to an issue.

So what’s changed?

Site vs. URL Errors

Crawl errors have been organized into two categories: site errors and URL errors. Site errors are those which are likely site-wide, as opposed to URL-specific. Google site errors

Site errors are categorized as:

  • DNS – These errors include things like DNS lookup timeout, domain name not found, and DNS error. (Although these specifics are no longer listed, as described more below.)
  • Server Connectivity – The errors include things like network unreachable, no response, connection refused, and connection reset. (These specifics are also no longer listed.)
  • Robots.txt Fetch – These errors are specific to the robots.txt file. If Googlebot receives a server error when trying to access this file, they have no way of knowing if a robots.txt file exists, and if so, what pages it blocks, so they stop the crawl until they no longer get an error when attempting to fetch it.
URL errors are page-specific.
Google page-level errors
URL errors are categorized as:
  • Server error – These are 5xx errors (such as 503 for server maintenance)
  • Soft 404 – These are URLs that are detected as returning an error page but don’t return a 404 response code (they typically have a response code of 200 or 301/302). Error pages that don’t return a 404 can hurt crawl efficiency as Googlebot can end up crawling these pages instead of valid pages you want indexed. In addition, these pages can end up in search results, which is not an ideal searcher experience.
  • Access denied -These are URLs that returned a 401, 403, or 407  response code. Often this simply means that the URLs prompt for a login, which is likely not an error. You may, however, want to block these URLs from crawling to improve crawl efficiency.
  • Not found – Typically, these are URLs that return a 404 or 410.
  • Not followed – (updated) These are URLs that triggered redirects that Googlebot had trouble crawling (for instance, because of a redirect loop). The UI lists whether the URL initially returned a 301 or 302, but doesn’t provide the details of the redirect error.
  • Other – This is a catch-all that includes all other errors.

Trends Over Time

Google now shows trends over the last 90 days for each error type. The daily count seems to be the aggregate count of how many URLs with that error type Google knows about, not the number crawled that particular day. As Google recrawls a URL and no longer gets the error, it’s removed from the list (and the count). In addition, Google still lists the date Googlebot first encountered the error, but now when you click the URL to see the details, you can see the last time Googlebot tried to access the URL as well.

Priorities and Fixed Status

Google says they are now listing URLs in priority order, based on a “multitude” of factors, including whether or not you can fix the problem, if the URL is listed in your Sitemap, if it gets a lot of traffic, and how many links it has. You can mark a URL as fixed and remove it from the list. However, once Google recrawls that page, if the error still exists, it will return to the list.

Google suggests using the Fetch as Googlebot feature to test your fix (and in fact now has a button right on the details page to do so), but since you are allowed only 500 fetches per account (not per site) each week (which I believe has increased from the previous limit), you should use this functionality judiciously.

What’s Gone Missing?

Unfortunately, several pieces of important functionality have been lost with this change.

  • Ability to download all crawl error sources. Previously, you could download a CSV file that listed URLs that returned an error along with the pages that linked to those URLs. You could then sort that CSV by linking source to find broken links within your site and had an easy list of sites to contact to fix links to important pages of your site. Now, the only way to access this information is to click on an individual URL to view its details, then click the Linked From tab. There seems to be no way to download this data, even at the individual URL level. (Update 3/17/12: This detail is still available from the API-based crawl errors feed.)
  • 100K URLs of each type. Previously, you could download up to 100,000 URLs with each type of error. Now, both the display and download are limited to 1,000. Google says “less is more” and “there was no realistic way to view all 100,000 errors—no way to sort, search, or mark your progress.” Google is wrong. There were absolutely realistic ways to view, sort, search, and mark your progress. The CSV download made all of this easy using Excel. And more data is always better to see patterns, especially for large scale sites with multiple servers, content management systems, and page templates. A lot has been lost here.  (Update 3/17/12: 100k URLs for each error is still available from the API-based crawl errors feed and API-based CSV download.)
  • Redirect errors – Inexplicably, the “not followed” errors no longer seem to list errors like redirect loop and too many redirects. Instead it simply lists the response code returned (301 or 302). This seems weird to me (not to mention extraordinarily less useful) as 301s are followed just fine and typically aren’t an error at all (and 302s are only sometimes problematic), but all the redirect errors that used to be listed are critical to know about and fix. Listing URLs that return a 301 status code as “not followed” is misleading and alarming for no reason. And if this list of URLs is actually those with redirect errors, then omitting what that error is (such as too many redirects) makes this data incredibly non-useful.  (Update 3/17/12: Confirmed with Google that is a list of URLs that return either a 301 or 302 that subsequently Googlebot is unable to crawl. The specific issue is still available from the API-based crawl errors feed and API-based CSV download.)
  • Specifics about soft 404s. The soft 404 report used to specify whether the URLs listed returned a 200 status code or redirected to an error page. But the status code column appears to be empty now.  (Update 3/17/12: This detail is still available from the API-based crawl errors feed and API-based CSV download.)
  • URLs blocked by robots.txt . Google says they removed this report because “while these can sometimes be useful for diagnosing a problem with your robots.txt file, they are frequently pages youintentionally blocked”. They say that similar information will soon be available in the crawler access section of webmaster tools. Why remove data you’re planning to replace before replacing it? Couldn’t they have just moved this report to the crawler access section? I get the feeling that they won’t be replacing this report as is, but providing less granular data in its place. While it’s true that this report didn’t list errors necessarily, it was very useful. You could skim the CSV to see if any sections of pages you expected to be indexed were blocked. And it was critical for diagnosis. Why aren’t certain pages indexed? You could check this report before spending extensive time debugging the issue. But now you can’t do either of those things. (Update 3/17/12: This report is still available from the API-based crawl errors feed and API-based CSV download.)
  • Specifics about site level errors. The previous version of these reports listed the specific problem (such as DNS lookup timeout or domain name not found). That was very helpful in digging into what was going on. Now, you only get the count for the general category, not the specifics of what kind of error it was within that category. (Update 3/17/12: This detail is still available from the API-based crawl errors feed and API-based CSV download.)
  • Specific URLs with “site” level errors. Google says you don’t need to know the URL if the issue was at the site level. Mostly, this is likely true. But I’ve definitely encountered cases, particularly with DNS errors, that the error only happened with specific URLs, not the entire site. Knowing the URL that triggered the error would help track down issues in these cases. (Update 3/17/12: This detail is still available from the API-based crawl errors feed and API-based CSV download.)

(Update 3/17/12: I got lots of additional detail form Google and as noted above, am happy to report that I was at least partially wrong — most of this data is still available through the API. Power users who want this level of detail are likely to prefer the API anyway, so my disappoint has lessened.)

As for my comment in the earlier version of this story where I said that “I get the sense that many of these recent changes are designed to make the data easier for small site owners to use, and don’t really have the large enterprise-level site (or agency) in mind. For these latter organizations, more data is better, as we have systems to parse and crunch the data”, Google has told me:

“Our strategy for Webmaster Tools is to improve the web interface and provide important, actionable, and useful information. Our changes are designed to improve the experience for all of our users, including power users. For example, we made changes to have crawl errors going back 90 days and to show the full aggregate count of URL errors instead of just the previous 100,000 cap. Power users can still access the firehose of data through our original GData API. One of the improvements we made is to now display the full count of URL errors, and that should help give more accurate data to larger sites. For example, previously if one site has over 35 million Not Found errors, that number would have been capped and shown as 100,000 errors. Now, that site can see the new number and even see where the increase happened in the historical data. We think that’s a big improvement.”

The point about the total number of errors shown is certainly a good one. Very large sites are likely to have more than 100k errors, and knowing the significance of the problem is helpful in prioritizing.

Of course, in part, I’m sad to see features that I worked hard on launching when I was product manager for webmaster central be dismantled and made less useful. But mostly, as a frequent user of the product, I don’t want to lose useful functionality. Update 3/17/12: As noted above, I’m happy that a lot of this functionality is still available through the API. Read on to my dive into how to access these details through the API.

Opinions expressed in the article are those of the guest author and not necessarily Search Engine Land.

Related Topics: Channel: SEO | Features: Analysis | Google: Webmaster Central | Top News

Sponsored


About The Author: is a Contributing Editor at Search Engine Land. She built Google Webmaster Central and went on to found software and consulting company Nine By Blue and create Blueprint Search Analytics< which she later sold. Her book, Marketing in the Age of Google, (updated edition, May 2012) provides a foundation for incorporating search strategy into organizations of all levels. Follow her on Twitter at @vanessafox.

Connect with the author via: Email | Twitter | Google+ | LinkedIn



SearchCap:

Get all the top search stories emailed daily!  

Share

Other ways to share:
 

Read before commenting! We welcome constructive comments and allow any that meet our common sense criteria. This means being respectful and polite to others. It means providing helpful information that contributes to a story or discussion. It means leaving links only that substantially add further to a discussion. Comments using foul language, being disrespectful to others or otherwise violating what we believe are common sense standards of discussion will be deleted. Comments may also be removed if they are posted from anonymous accounts. You can read more about our comments policy here.
  • http://www.ihsekat.com/ Takeshi

    I’m loving the ability to mark errors as “fixed”. It’s nice to clear out a page of errors without having to wait for Google to re-crawl everything. I also agree with their decision to move robots.txt stuff out of the errors tab– I’ve personally never found it useful, and was always confused why pages I had purposely blocked out were showing as “errors”.

    The loss of ability to download all the errors is a shame, though. It seems like a lot of these changes make things less convenient to manage for larger sites, while making things simpler for smaller sites.

  • http://scalefigure.com Jason Meininger

    Good overview Vanessa. You of all people are the ideal one to call out the problems with the new changes.

    I’m seeing a big upswing in “soft” 404s, and on investigation many of them appear to be valid 301s or indeed still-functioning pages with no clear information about *why* they’ve shown up in this list. It makes me lose confidence in the report.

    I think the ‘mark as fixed’ function is pretty useless if Google doesn’t agree things are fixed, and on a large site I’m not remotely likely to bother going through ticking boxes. They can determin whether it’s fixed when they crawl the site. I still find it frustrating that once reported, errors take aaages to go away on their own, even if the next time Google crawled the error did not happen. Knowing something broke for a little while is far less useful than knowing what is broken *right now* but they all seem lumped together.

    I also agree losing the ability to download the errors is a major fault – it does make mean we’ve lost a whole lot of data on an enterprise-level site and made troubleshooting just a lot more foggy.

    I wonder how many of these changes were based on actual user feedback?

  • http://www.jlh-marketing.com Jenny Halasz

    How disappointing to see a company who is all about user experience make their tool less functional. Thanks for taking the time to go through all this for us, Vanessa; I was wondering if I was just missing something. I hope your considerable clout at Google can get this fixed for all those of us who rely on webmaster tools for important data.

  • Chas

    Hard 404~ Closed all Google Accounts~ Error Fixed.

  • http://www.treeeye.com Lnoto

    inevitable to produce some errors.it is a good manner to Revamp it.

  • http://ides.com/nathanpotter pottern

    Thanks Vanessa – great coverage of the new Google Webmaster Central update – I too was disappointed with the first two items you covered (Ability to download all crawl error sources and
    100K URLs of each type). We host a large site and these two features were incredibly useful for us – hoping they add them back into the new design which I agree, is a nice update. Thanks again!

  • http://www.molotov-peacock.co.uk Kat Wesley

    Great roundup.

    I’ve noticed on the sites I manage that Not Followed is indeed a list of redirects that don’t work for some reason, rather than all the 301s and 302s that exist on the website.

    While clicking the URL will bring up more information about the error and the option to fetch the page as Googlebot, similar to the other new panels, I don’t have enough different URLs listed under Not Followed to work out whether the message (“There was a problem with active content or redirects”) will change based on what triggered the error or is just a generic hint about the reason the URL is listed in the Not Followed tab.

    If we really can’t drill down to the cause of the error anymore, I’ll be disappointed, but bringing the option to fetch the URL as Googlebot into this section does at least make it a little easier to work out what might have gone wrong.

  • gbfhsghtr

    The value of fashion women heels like Chanel Shoes due to its desinger and unique, so their copy will be very low cost but high quality.Most girls prefer to buy a cheap copy 
    from Christian Louboutin Outlet Online Store, their material are almost the sale as the original one,even you can 
    get a Cheap Prada Shoes here.

  • Joey Garcia

    Using this new tool I discovered that I had external websites pointing to fake links on my site, I don’t want to add a page just so I don’t get a 404, so is there a way to hightlight those links and indicate on to the GoogleBot not to crawl those?  The external site that made them was a site that just generated the URL and when you click on it redirects to some escort service page.

    Is there anything I can do about this?

Get Our News, Everywhere!

Daily Email:

Follow Search Engine Land on Twitter @sengineland Like Search Engine Land on Facebook Follow Search Engine Land on Google+ Get the Search Engine Land Feed Connect with Search Engine Land on LinkedIn Check out our Tumblr! See us on Pinterest

 
 

Click to watch SMX conference video

Join us at one of our SMX or MarTech events:

United States

Europe

Australia & China

Learn more about: SMX | MarTech


Free Daily Search News Recap!

SearchCap is a once-per-day newsletter update - sign up below and get the news delivered to you!

 


 

Search Engine Land Periodic Table of SEO Success Factors

Get Your Copy
Read The Full SEO Guide