Google “Knows” About 1 Trillion Web Items

How big is Google’s web index? Google hasn’t said for years and still isn’t saying — but it is blogging today about how it "knows" of 1 trillion web-based items. Joy, we haven’t had a search engine talk about pages it knows about to confuse things since the Lycos days. Glad to have Google making the first shot to take us back into the search engine size wars of old.

As a short refresher, Google used to list the number of pages it had indexed on its home page. It dropped that count back in September 2005, after Yahoo (for a short period) had claimed to have indexed more. Both search engines swapped PR blows over who was bigger, then we got detente when the count went away.

That was good. Very good. This is because search engine size has long been used as a substitute for a relevancy metric that doesn’t exist. If a search engine wants to seem twice as good as a competitor, they need only trot out a bar chart showing they have twice as many documents as their competitor. Plus, you toss in the famous haystack metaphor. You can’t find the needle in the haystack if you’re searching only half the haystack!

But more documents doesn’t mean better relevancy. Indeed, more documents can make a search engine worse, if it doesn’t have good relevancy. My turn on the haystack metaphor has always been to say that if I dump the entire haystack on your head, does that help you find the needle? Chances are, you just get overwhelmed by a bunch of hay.

Still, size has long been an appealing stat that the search engines would go to — which in turn would cause search engines to find ways to inflate the size figures they’d report. Way back in the 1996, Lycos talked about the number of pages it "knew" about, even those these weren’t actually indexed or made accessible to searchers. Excite was so annoyed that it pushed back with a page on how to count URLs, as you can see archived here.

Now we’ve got Google talking about "knowing" 1 trillion items of content out there:

Recently, even our search engineers stopped in awe about just how big the web is these days — when our systems that process links on the web to find new content hit a milestone: 1 trillion (as in 1,000,000,000,000) unique URLs on the web at once!

It’s easy to come away with the idea that Google lets you search against 1 trillion documents. That’s not the case, as the post does explain:

We don’t index every one of those trillion pages — many of them are similar to each other, or represent auto-generated content similar to the calendar example that isn’t very useful to searchers. But we’re proud to have the most comprehensive index of any search engine, and our goal always has been to index all the world’s data.

All I really want is the last part — that Google has what it believes to be a comprehensive index of the web. I don’t even want them or anyone saying that they have the "most" comprehensive, given that this is so difficult to verify. Indeed, consider this from Google’s own post:

So how many unique pages does the web really contain? We don’t know; we don’t have time to look at them all! :-) Strictly speaking, the number of pages out there is infinite — for example, web calendars may have a "next day" link, and we could follow that link forever, each time finding a "new" page. We’re not doing that, obviously, since there would be little benefit to you. But this example shows that the size of the web really depends on your definition of what’s a useful page, and there is no exact answer.

Right, there’s no exact answer to what’s a useful page — and so in turn, there’s no one exact answer to who has the "most" of them collected. Tell me you have a good chunk of the web, and I’m fine. But when Google or any search engine starts making size claims, my hackles go way up. There are better things to focus on.

For more, see related discussion on Techmeme.

Related Topics: Channel: Strategy | Google: Marketing | Google: Web Search | Stats: Size


About The Author: is a Founding Editor of Search Engine Land. He’s a widely cited authority on search engines and search marketing issues who has covered the space since 1996. Danny also serves as Chief Content Officer for Third Door Media, which publishes Search Engine Land and produces the SMX: Search Marketing Expo conference series. He has a personal blog called Daggle (and keeps his disclosures page there). He can be found on Facebook, Google + and microblogs on Twitter as @dannysullivan.

Connect with the author via: Email | Twitter | Google+ | LinkedIn


SMX - Search Marketing Expo

SearchCap:

Get all the top search stories emailed daily!  

Like This Story? Please Share!

Other ways to share:

Like Our Site? Follow Us!

Subscribe to Our Feed! Join our LinkedIn Group Check out our Tumblr! See us on Pinterest Get Search Engine Land on your mobile device!
 

Read before commenting! We welcome constructive comments and allow any that meet our common sense criteria. This means being respectful and polite to others. It means providing helpful information that contributes to a story or discussion. It means leaving links only that substantially add further to a discussion. Comments using foul language, being disrespectful to others or otherwise violating what we believe are common sense standards of discussion will be deleted. Comments may also be removed if they are posted from anonymous accounts. You can read more about our comments policy here.

Comments are closed.

Get Our News, Everywhere!

 
  • Advertise With Us
 

Click to watch SMX conference video

Join us at an upcoming SMX event:

North America

EMEA

APAC

Search Engine Land produces SMX, the Search Marketing Expo conference series. SMX events deliver the most comprehensive educational and networking experiences - whether you're just starting in search marketing or you're a seasoned expert.

SMX Site » | SMX Difference » | SMX News »




 

Search Engine Land Periodic Table of SEO Ranking Factors

Get Your Copy
Read The Full SEO Guide