Google “Knows” About 1 Trillion Web Items

How big is Google’s web index? Google hasn’t said for years and still isn’t saying — but it is blogging today about how it "knows" of 1 trillion web-based items. Joy, we haven’t had a search engine talk about pages it knows about to confuse things since the Lycos days. Glad to have Google making the first shot to take us back into the search engine size wars of old.

As a short refresher, Google used to list the number of pages it had indexed on its home page. It dropped that count back in September 2005, after Yahoo (for a short period) had claimed to have indexed more. Both search engines swapped PR blows over who was bigger, then we got detente when the count went away.

That was good. Very good. This is because search engine size has long been used as a substitute for a relevancy metric that doesn’t exist. If a search engine wants to seem twice as good as a competitor, they need only trot out a bar chart showing they have twice as many documents as their competitor. Plus, you toss in the famous haystack metaphor. You can’t find the needle in the haystack if you’re searching only half the haystack!

But more documents doesn’t mean better relevancy. Indeed, more documents can make a search engine worse, if it doesn’t have good relevancy. My turn on the haystack metaphor has always been to say that if I dump the entire haystack on your head, does that help you find the needle? Chances are, you just get overwhelmed by a bunch of hay.

Still, size has long been an appealing stat that the search engines would go to — which in turn would cause search engines to find ways to inflate the size figures they’d report. Way back in the 1996, Lycos talked about the number of pages it "knew" about, even those these weren’t actually indexed or made accessible to searchers. Excite was so annoyed that it pushed back with a page on how to count URLs, as you can see archived here.

Now we’ve got Google talking about "knowing" 1 trillion items of content out there:

Recently, even our search engineers stopped in awe about just how big the web is these days — when our systems that process links on the web to find new content hit a milestone: 1 trillion (as in 1,000,000,000,000) unique URLs on the web at once!

It’s easy to come away with the idea that Google lets you search against 1 trillion documents. That’s not the case, as the post does explain:

We don’t index every one of those trillion pages — many of them are similar to each other, or represent auto-generated content similar to the calendar example that isn’t very useful to searchers. But we’re proud to have the most comprehensive index of any search engine, and our goal always has been to index all the world’s data.

All I really want is the last part — that Google has what it believes to be a comprehensive index of the web. I don’t even want them or anyone saying that they have the "most" comprehensive, given that this is so difficult to verify. Indeed, consider this from Google’s own post:

So how many unique pages does the web really contain? We don’t know; we don’t have time to look at them all! :-) Strictly speaking, the number of pages out there is infinite — for example, web calendars may have a "next day" link, and we could follow that link forever, each time finding a "new" page. We’re not doing that, obviously, since there would be little benefit to you. But this example shows that the size of the web really depends on your definition of what’s a useful page, and there is no exact answer.

Right, there’s no exact answer to what’s a useful page — and so in turn, there’s no one exact answer to who has the "most" of them collected. Tell me you have a good chunk of the web, and I’m fine. But when Google or any search engine starts making size claims, my hackles go way up. There are better things to focus on.

For more, see related discussion on Techmeme.

Related Topics: Channel: Strategy | Google: Marketing | Google: Web Search | Stats: Size

Sponsored


About The Author: is a Founding Editor of Search Engine Land. He’s a widely cited authority on search engines and search marketing issues who has covered the space since 1996. Danny also serves as Chief Content Officer for Third Door Media, which publishes Search Engine Land and produces the SMX: Search Marketing Expo conference series. He has a personal blog called Daggle (and keeps his disclosures page there). He can be found on Facebook, Google + and microblogs on Twitter as @dannysullivan.

Connect with the author via: Email | Twitter | Google+ | LinkedIn



SearchCap:

Get all the top search stories emailed daily!  

Share

Other ways to share:
 

Read before commenting! We welcome constructive comments and allow any that meet our common sense criteria. This means being respectful and polite to others. It means providing helpful information that contributes to a story or discussion. It means leaving links only that substantially add further to a discussion. Comments using foul language, being disrespectful to others or otherwise violating what we believe are common sense standards of discussion will be deleted. Comments may also be removed if they are posted from anonymous accounts. You can read more about our comments policy here.

Comments are closed.

Get Our News, Everywhere!

Daily Email:

Follow Search Engine Land on Twitter @sengineland Like Search Engine Land on Facebook Follow Search Engine Land on Google+ Get the Search Engine Land Feed Connect with Search Engine Land on LinkedIn Check out our Tumblr! See us on Pinterest

 
 

Click to watch SMX conference video

Join us at one of our SMX or MarTech events:

United States

Europe

Australia & China

Learn more about: SMX | MarTech


Free Daily Search News Recap!

SearchCap is a once-per-day newsletter update - sign up below and get the news delivered to you!

 


 

Search Engine Land Periodic Table of SEO Success Factors

Get Your Copy
Read The Full SEO Guide