Google “Reveals Index Secrets”: Charts Indexing of Your Site Over Time

Yesterday, Google webmaster tools launched Index Status (available under Health) that charts the number of indexed pages for your site over the last year.

Google Index Status

Total Indexed Count

Google says that this count is accurate (unlike the site: search operator) and is post-canonicalization. In other words, if your site includes a lot of duplicate URLs (due to things like tracking parameters) and the pages include the canonical attribute or Google has otherwise identified and clustered those duplicate URLs, this count only includes that canonical version and not the duplicates. You can also get this data by submitting XML Sitemaps but you’ll only see complete indexing numbers if your Sitemaps are comprehensive.

Google also charts this data over time for the past year.

Edited to add: Google has told me that the data may have a lag time of a couple of weeks, which makes it more useful for trends than for real-time action. Also, if you look at domain.com, you’ll see stats for all subdomains, and if you look at www.domain.com, you’ll see stats for only the www subdomain (of course this means that if you don’t use www for your site as with searchengineland.com, there’s no easy way to see this data with subdomain information excluded.)

Advanced Status: How This Data Is Useful and Actionable

The Advanced option provides additional details:

Google Index Status Advanced

Great, right? More data is always good! Well, maybe. The key is what you take away from the data and how you can use it. To make sense of this data, the best approach is to exclude the Ever Crawled number and look at it separately (more on that in a moment). So, you’re left with:

  • total indexed
  • not selected
  • blocked by robots

The sum of these three numbers tells you the number of URLs Google is currently considering. In the example above, Google is looking at 252,252 URLs. 22,482 of those are blocked by robots.txt, which is fairly straightforward. This mostly matches the number of URLs reported as blocked under Blocked URLs (22,346). Unfortunately, it’s become difficult to look at the list of what those URLs are. The blocked URLs report is no longer available in the UI, although it is available through the API. That leaves 229,770 URLs. Which means 74% of the URLs weren’t selected for the index. Why not? Is this bad? The trouble with looking at these numbers without context is that it’s difficult to tell.

Let’s say we’re looking at a site with 50,000 indexable pages. Has Google crawled only 31,480 unique pages and indexed all of them? (In this case, all of the not selected would be non-canonical URL variations with tracking codes and the like.) Or has Google crawled all 50,000 (plus non-canonical variations) but has decided only 31,480 of the 50,000 were valuable enough to index? Or maybe only 10,000 of those URLs indexed are unique, and due to problems with canonicalization, a lot of duplicates are indexed as well.

This problem is difficult to solve without a lot of other data points to provide context. Google told me that:

“A URL can be not selected for indexing for many reasons including:

  • It redirects to another page
  • It has a rel=”canonical” to another page
  • Our algorithms have detected that its contents are substantially similar to another URL and picked the other URL to represent the content.”

If the not selected count is solely showing the number of non-canonical URLs, then we can generally extrapolate that for our example, Google has seen 31,480 unique pages from our 50,000-page site and has crawled a lot of non-canonical versions of those pages as well. If the not selected count also includes pages that Google has decided aren’t valuable enough to index (because they are blank, boilerplate only, or spammy), then things are less clear. (Edited to add: Google has further clarified that “not selected” includes any URLs flagged as non-canonical (and the third bullet above  could include blank, boilerplate, or duplicate pages), with meta robots noindex tags, and that redirect and is not based on page quality.)

If 74% of Google’s crawl is of non-canonical URLs that aren’t indexed and redirects, is that a bad thing? Not necessarily. But it’s worth taking a look your URL structure. Non-canonical URLs are unavoidable: tracking parameters, sort orders, and the like. But can you make the crawl more efficient so that Google can get to all 50,000 of those unique URLs? Google’s Maile Ohye has some good tips for ecommerce sites on her blog. Make sure you’re making full use of Google’s parameter handling features to indicate which parameters shouldn’t be crawled at all. For very large sites, crawl efficiency can make a substantial difference in long tail traffic. More pages crawled = more pages indexed = more search traffic.

Ever Crawled

What about the ever crawled number? This data points should be looked at separately from the rest as it’s an aggregate number from all time. In our example, 1.5 million URLs have been crawled. But Google is currently considering only 252,252 URLs. What’s up with the other 1.2 million? This number includes things like 404s, but tor this same site, Google is reporting only 5,000 of those, so that doesn’t account for everything. Since this count is “ever” rather than “current”, things like 404s have surely piled up over time. Edited to add: Google has clarified that all numbers are for HTML files only, and not for filetypes like images, CSS files or JavaScript files.

In any case, I think this number is much more difficult to gain actionable insight from. If the ever crawled number is substantially smaller than the size of your site, then this number is very useful indeed as some problem definitely exists that you should dive into. But for the sites I’ve looked at so far, the ever crawled number is substantially higher than the site size.

Site size can be difficult to pin down, but for those of you who have good sense of that, are you finding that most of your pages are indexed?

Opinions expressed in the article are those of the guest author and not necessarily Search Engine Land.

Related Topics: Channel: SEO | Features: Analysis | Google: Webmaster Central | Top News

Sponsored


About The Author: is a Contributing Editor at Search Engine Land. She built Google Webmaster Central and went on to found software and consulting company Nine By Blue and create Blueprint Search Analytics< which she later sold. Her book, Marketing in the Age of Google, (updated edition, May 2012) provides a foundation for incorporating search strategy into organizations of all levels. Follow her on Twitter at @vanessafox.

Connect with the author via: Email | Twitter | Google+ | LinkedIn



SearchCap:

Get all the top search stories emailed daily!  

Share

Other ways to share:
 

Read before commenting! We welcome constructive comments and allow any that meet our common sense criteria. This means being respectful and polite to others. It means providing helpful information that contributes to a story or discussion. It means leaving links only that substantially add further to a discussion. Comments using foul language, being disrespectful to others or otherwise violating what we believe are common sense standards of discussion will be deleted. Comments may also be removed if they are posted from anonymous accounts. You can read more about our comments policy here.
  • http://www.facebook.com/drew.pokoj Drew Pokoj

    I wonder what this data is based off of, I have a BRAND new site, launched about 2weeks ago.. and got it in the index by the next day. Currently it has 2,000 some pages that show up, as I am already starting to get a few hits per day via the SERPs, this new chart show 0 indexed pages for my site… odd.

    oh, and yes.. there is crawl data and search queries + clicks already in my webmaster tools

  • http://twitter.com/roseberry9 Tom Roseberry

    any idea if this includes all subdomains. You have to set up accounts for each separately in GWMT (though not Bing, which i like much better), but these index numbers look to be too high not to be inclusive of all subs.

  • http://profile.yahoo.com/3AAUJHFOCWMBJXWJJZLHU42VKA Priscilla R. May

    You have to set up accounts for each separately in GWMT (though not Bing, which i like much better), http://BusinessInsiderWsj.blogspot.com

  • http://twitter.com/LoginRadius LoginRadius

    Interesting and userful … let me check it for my startup LoginRadius – which offers social infrastructure to businesses! Btw, they should come up something like this based on social networks, what do you think?

  • http://top5ives.blogspot.com/ Majid Ali

    Index Status is useful and interesting. I will check if it works for me.

  • Mike Miller

    I am seeing the same thing.  I did a site:www.sitename.com and saw 164,000 pages indexed, but according to GWT, I’m seeing 1.37mil.  I’m assuming subdomains are factored into this number, which almost makes this report useless

  • http://profiles.google.com/singh8954 singh 09

    Good news now we all can find how much google is indexing.What happened when some revamp their websites if redirect url will consider?

  • http://www.way2earning.com/ Suresh

    This is a great move by Google. I opine Index status helps webmasters to identify the redirected and 
    canonical pages. 

  • http://twitter.com/bsdeshmukh Babarao Deshmukh

    Gtalk is down… make a post on it… thanks

  • http://twitter.com/HP2Z23 Corina C.Ramirez

    Yes same here – its down from last 1 hour

  • http://twitter.com/roseberry9 Tom Roseberry

     Yeah, I could see that. I’m usually concerned with the site as a whole and was always annoyed that G WMT didn’t allow all subdomains to roll up to one “site” so i like that they’d show indexing across the entire domain. Though they should be consistent and an option of which way you’d prefer to configure would be nice too.

  • http://guymanningham.com/ Guy Manningham

    Great info. Google, you ellusive temptress! You always seems to change stuff just as I get up to speed.

  • http://guymanningham.com/ Guy Manningham

    Great info. Google, you ellusive temptress! You always seems to change stuff just as I get up to speed.

  • http://www.ninebyblue.com Vanessa Fox

    See edit in the article. I asked Google about subdomains and they said those numbers are only included if you’re looking at sitename.com (not if you’re looking at http://www.sitename.com). 

    Note that site: search numbers are notoriously inaccurate. How many pages does your site actually have?

  • http://www.ninebyblue.com Vanessa Fox

    See update in article: Google has told me that there is a lag in this data.

  • http://www.ninebyblue.com Vanessa Fox

    Redirects are included in the “not selected” number.

  • http://twitter.com/roseberry9 Tom Roseberry

     Thanks – Vanessa. it’s webmd.com. And i don’t really know. several million. however most of the bulk that’s not as easy to know for sure is on subdomains like forums.webmd.com. Just looking at the www. version in WMT is more URLs than we have, at least as i think of it. But now that i consider it more this is probably including all paginated URLs (page=2, page=3, etc.) so that would make sense.

    Anyway, greatly appreciate you asking the follow up and reposting.

    Tom

  • http://www.brickmarketing.com/ Nick Stamoulis

    It’s always interesting to see your site through the eyes of Google. It may not be 199% accurate, but if you’re fairly confident your site has 10,000 pages and Google is only indexing 5,000 of them you know something is up.

  • http://www.devonwebdesigners.com/ Elizabeth Jamieson

    I wish they’d identify which pages are the ones included in the not selected category.

  • Mahendra Varma

    It was quite iteresting  and very useful information now we can get the site status very clearly.

  • Michael Carlin

    SO how do we lower the no selected count?  I can’t remove redirects or I’ll lose that link juice, I have no dupe content, and you say that even if I no-index tag pages and archives on WordPress that they will still be part of this number….

Get Our News, Everywhere!

Daily Email:

Follow Search Engine Land on Twitter @sengineland Like Search Engine Land on Facebook Follow Search Engine Land on Google+ Get the Search Engine Land Feed Connect with Search Engine Land on LinkedIn Check out our Tumblr! See us on Pinterest

 
 

Click to watch SMX conference video

Join us at one of our SMX or MarTech events:

United States

Europe

Australia & China

Learn more about: SMX | MarTech


Free Daily Search News Recap!

SearchCap is a once-per-day newsletter update - sign up below and get the news delivered to you!

 


 

Search Engine Land Periodic Table of SEO Success Factors

Get Your Copy
Read The Full SEO Guide